You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/08/13 21:39:00 UTC

[jira] [Commented] (IMPALA-3976) Handle partition-key values with multiple synonymous string representations created in Hive.

    [ https://issues.apache.org/jira/browse/IMPALA-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177332#comment-17177332 ] 

Tim Armstrong commented on IMPALA-3976:
---------------------------------------

This particular repro is now rejected.
{noformat}
Caused by: org.apache.hadoop.hive.metastore.api.AlreadyExistsException: Partition already exists
{noformat}

> Handle partition-key values with multiple synonymous string representations created in Hive.
> --------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-3976
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3976
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 2.3.0, Impala 2.5.0, Impala 2.4.0, Impala 2.6.0, Impala 2.7.0
>            Reporter: Alexander Behm
>            Priority: Major
>              Labels: correctness, incompatibility
>
> For several SQL statements that can create new partitions, Hive seems to generate partition-key values and the corresponding HDFS directory based on the user's string input rather than the corresponding literal value of the appropriate column type. This leads to a situation where a single logical partition-key value can map to multiple HDFS directories and Hive partitions.
> Example in Hive:
> {code}
> CREATE TABLE t (i INT) PARTITIONED BY (p INT);
> ALTER TABLE t ADD PARTITION (p=0);
> ALTER TABLE t ADD PARTITION (p=00);
> ALTER TABLE t ADD PARTITION (p=000);
> SHOW PARTITIONS t;
> p=0
> p=00
> p=000
> {code}
> The above statements will result in three different HDFS directories, one for each of the "distinct" partitions.
> The same result can be achieved with static partition inserts from Hive, instead of ALTER TABLE ADD PARTITION.
> Note that Impala will a canonical representation for any partition-key value based on the underlying LiteralExpr, so a similarly strange metadata state cannot be created from Impala, even if given the same input as in the example above.
> A special case of this issue was reported in HIVE-6590 and IMPALA-3963, but the underlying problem is more general.
> *Issues in Impala*
> Impala has difficulties dealing with such ambiguous partitions due to the internal assumption that a single assignment of values to partition keys maps to a single Hive partition with a one corresponding HDFS directory.
> As long as the cached partition metadata in Impala is correct, queries will return correct results even with partition filters. Impala effectively coalesces the different partition variants, for example, SELECT * FROM t WHERE p=0 will scan all three directories from the example above.
> The following statements are known have problems in Impala if such ambiguous partitions exist:
> * REFRESH <table> and REFRESH <partition>. After such a statement Impala may duplicate and/or missing partitions, leading to incorrect query results.
> * ALTER TABLE RECOVER PARTITIONS, same as REFRESH above.
> * ALTER TABLE <table> DROP PARTITIONS. Impala will only be able to drop the one partition with the the canonical value representation. Other variants of the same partition cannot be dropped.
> * Any other ALTER TABLE ... PARTITION(). Impala will only modify the one partition with the canonical value representation (if any).
> * It is safest to assume that all other metadata statements that operate on a single partition are likewise not functioning as intended.
> *Workarounds*
> * Ensure that partitions created via Hive do not exhibit ambiguity. Stick to a single partition-key value representation, e.g., use p=0 consistently and avoid variants like p=000.
> * Avoid those statements in Hive that can create the bad metadata. Always use fully dynamic partition inserts and avoid adding partitions via static partition inserts or ALTER TABLE.
> * Running INVALIDATE METADATA <table> will bring Impala's metadata back into a consistent state (including all partition variants). Queries will return correct results, but some DDL operations may still not fully work (like DROP PARTITION).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org