You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/06/16 15:34:01 UTC

[jira] [Commented] (IMPALA-3127) Decouple partitions from tables

    [ https://issues.apache.org/jira/browse/IMPALA-3127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136742#comment-17136742 ] 

ASF subversion and git services commented on IMPALA-3127:
---------------------------------------------------------

Commit 419aa2e30db326f02e9b4ec563ef7864e82df86e in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=419aa2e ]

IMPALA-9778: Refactor partition modifications in DDL/DMLs

After this patch, in DDL/DMLs that update metadata of partitions,
instead of updating partitions in place, we always create new ones and
use them to replace the existing instances. This is guarded by making
HdfsPartition immutable. There are several benefits for this:
 - HdfsPartition can be shared across table versions. In full catalog
   update mode, catalog update can ignore unchanged partitions
   (IMPALA-3234) and send the update in partition granularity.
 - Aborted DDL/DMLs won't leave partition metadata in a bad shape (e.g.
   IMPALA-8406), which usually requires invalidation to recover.
 - Fetch-on-demand coordinators can cache partition meta using the
   partition id as the key. When table version updates, only metadata of
   changed partitions need to be reloaded (IMPALA-7533).
 - In the work of decoupling partitions from tables (IMPALA-3127), we
   don't need to assign a catalog version to partitions since the
   partition ids already identify the partitions.

However, HdfsPartition is not strictly immutable. Although all its
fields are final, some fields are still referencing mutable objects. We
need more refactoring to achieve this. This patch focuses on refactoring
the DDL/DML code paths.

Changes:
 - Make all fields of HdfsPartition final. Move
   HdfsPartition constructor logics and all its update methods into
   HdfsPartition.Builder.
 - Refactor in-place updates on HdfsPartition to be creating a new one
   and dropping the old one. HdfsPartition.Builder represents the
   in-progress modifications. Once all modifications are done, call its
   build() method to create the new HdfsPartition instance. The old
   HdfsPartition instance is only replaced at the end of the
   modifications.
 - Move the "dirty" marker of HdfsPartition into a map of HdfsTable. It
   maps from the old partition id to the in-progress partition builder.
   For "dirty" partitions, we’ll reload its HMS meta and file meta.

Tests:
 - No new tests are added since the existing tests already provide
   sufficient coverage
 - Run CORE tests

Change-Id: Ib52e5810d01d5e0c910daacb9c98977426d3914c
Reviewed-on: http://gerrit.cloudera.org:8080/15985
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Decouple partitions from tables
> -------------------------------
>
>                 Key: IMPALA-3127
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3127
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Catalog
>    Affects Versions: Impala 2.2.4
>            Reporter: Dimitris Tsirogiannis
>            Assignee: Vihang Karajgaonkar
>            Priority: Major
>              Labels: catalog-server, performance
>
> Currently, partitions are tightly integrated into the HdfsTable objects, making incremental metadata updates difficult to perform. Furthermore, the catalog transmits entire table metadata even when only few partitions change, introducing significant latencies, wasting network bandwidth and CPU cycles while updating table metadata at the receiving impalads. As a first step, we should decouple partitions from tables and add them as a separate level in the hierarchy of catalog entities (server-db-table-partition). Subsequently, the catalog should transmit only entities that have changed after DDL/DML statements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org