You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Sourabh Goyal (Jira)" <ji...@apache.org> on 2021/12/07 13:38:00 UTC

[jira] [Created] (IMPALA-11050) Skip file metadata reloading in AlterPartition event from event processor in catalogd

Sourabh Goyal created IMPALA-11050:
--------------------------------------

             Summary: Skip file metadata reloading in AlterPartition event from event processor in catalogd
                 Key: IMPALA-11050
                 URL: https://issues.apache.org/jira/browse/IMPALA-11050
             Project: IMPALA
          Issue Type: Improvement
          Components: Catalog
            Reporter: Sourabh Goyal
            Assignee: Sourabh Goyal


HdfsPartition in catalogD is a collection of files and each file is represented by a FileDescriptor. A fd contains:

1.  RelativePath of this file
2. Length of the file
3. Compression info like GZIP etc
4. Modification time of the file
5. Blocks info that belong to this file. Each block has info like offset, length, diskIds

When the event processor processes an AlterPartitionEvent, currently it reloads the partition again along with file metadata reloading. Reloading of file metadata is a relatively expensive operation as it involves listing files in the underlying filesystem. From the Impala shell, an alter partition is triggered via ALTER TABLE PARTITION <partition_spec> <operation>.  Here operation can be: 
 # Update stats
 # Drop stats
 # Set file format
 # Set row format
 # Set table properties
 # Unset table properties
 # Set serde properties
 # Unset serde properties
 # Set cached <hdfs-pool-name>
 # Unset cached <hdfs-pool-name>
 # Set location

 

*For transactional tables:*

For transactional tables, if the incremental refresh is enabled, event processor reloades file metadata at the CommitTxn event. Since there is no way to know whether the commit txn event was due to alter_partition or some other event, file metadata reloading can not be skipped. 

*For external tables:* 

From the operations above, any operation that affects the underlying storage descriptor of a partition should trigger the file metadata reloading. Operations 3,4,7,8,11 are such operations. 

 

*How to detect change in file descriptor of a partition:*

HMS partition object received in alter_partition event contains metastore.api.StorageDescriptor object. This object has fields like: 
 * List<FieldSchema> cols
 * String location 
 * String inputFormat
 * String outputFormat
 * Boolean compressed
 * Boolean numBuckets
 * SerdeInfo serdeInfo
 * LIst<String> bucketCols
 * List<Order> sortCols
 * Map<String, String> params

 

Fetch HMS partition object from alterPartition event and compare its storage descriptor properties with the similar properties of already cached partition object

{*}Unknowns{*}: 
 # If a partition is cached in HDFS, should we always reload its filemetadata (irrespective of any of the operations mentioned above) to get most up to date block locations? 

 

cc - [~vihangk1] [~stigahuang] [~hsnusonic] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)