You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2018/06/11 18:02:00 UTC

[jira] [Comment Edited] (HIVE-19830) Inconsistent behavior when multiple partitions point to the same location

    [ https://issues.apache.org/jira/browse/HIVE-19830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16508443#comment-16508443 ] 

Sergey Shelukhin edited comment on HIVE-19830 at 6/11/18 6:01 PM:
------------------------------------------------------------------

Hive metadata is source of truth for managed tables... in fact recently Hive is moving even further in that direction (such as with ACID, where what is on disk doesn't matter for metadata at all, and only ACID state in metastore matters; or with stats-based queries, where Hive assumes for managed tables that all data changes are made thru Hive, and can be accounted for to update or at least invalidate the stats).
There are features that cover various scenarios like the ones you mention (e.g. views, materialized views, or finally external tables where Hive makes no assumptions about data).
Pointing partitions to the same directory is simply not supported, and it generally works only by accident - i.e. as long as deletes are not involved; it could perhaps be supported, but it needs to be done as an explicit feature (symlink partitions?). Even comparing the JIRA description to the "latest" partition use case, for example, shows semantic discrepancy - for the j=1 and j=2 example, you want the same directory to be read twice for sum(), but if you have a "latest" partition you wouldn't want it double counted with the actual date partition. So there needs to be some explicit semantics that Hive needs to be aware of, for this to work.

cc [~ashutoshc] for more context.


was (Author: sershe):
Hive metadata is source of truth for managed tables... in fact recently Hive is moving even further in that direction (such as with ACID, where what is on disk doesn't matter for metadata at all, and only ACID state in metastore matters; or with stats-based queries, where Hive assumes for managed tables that all data changes are made thru Hive, and can be accounted for to update or at least invalidate the stats).
There are features that cover various scenarios like the ones you mention (e.g. views, materialized views, or finally external tables where Hive makes no assumptions about data).
Pointing partitions to the same directory is simply not supported, and it generally works only by accident - i.e. as long as deletes are not involved; it could perhaps be supported, but it needs to be done as an explicit feature (symlink partitions?)
cc [~ashutoshc] for more context.

> Inconsistent behavior when multiple partitions point to the same location
> -------------------------------------------------------------------------
>
>                 Key: HIVE-19830
>                 URL: https://issues.apache.org/jira/browse/HIVE-19830
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.4.0
>            Reporter: Gabor Kaszab
>            Assignee: Adam Szita
>            Priority: Major
>
> // create a table with 2 partitions where both partitions share the same location and inserting a single line to one of them.
> create table test (i int) partitioned by (j int) stored as parquet;
> alter table test add partition (j=1) location 'hdfs://localhost:20500/test-warehouse/test/j=1';
> alter table test add partition (j=2) location 'hdfs://localhost:20500/test-warehouse/test/j=1';
> insert into table test partition (j=1) values (1);
> // select * show this single line in both partitions as expected.
> select * from test;
> 1 1
> 1 2
> // however, sum() doesn't add up the line for all the partitions. This is +Issue #1+.
> select sum( i), sum(j) from test;
> 1 2
> // On the file system there is a common dir for the 2 partitions that is expected.
> hdfs dfs -ls hdfs://localhost:20500/test-warehouse/test/
> Found 1 items
> drwxr-xr-x - gaborkaszab supergroup 0 2018-06-08 10:54 hdfs://localhost:20500/test-warehouse/test/j=1
> // Let's drop one of the partitions now!
> alter table test drop partition (j=2);
> // running the same hdfs dfs -ls command shows that the j=1 directory is dropped. I think this is a good behavior, we just have to document that this is the expected case.
> // select * from test; returns zero rows, this is still as expected.
> // Even though the dir is dropped j=1 partition is still visible with show partitions. This is +Issue #2+.
> show partitions test;
> j=1
> After dropping the directory with Hive, when Impala reloads it's partitions it asks Hive to tell what are the existing partitions. Apparently, Hive sends down a list with j=1 partition included and then Impala takes it as an existing one and doesn't drop it from Catalog's cache. Here Hive shouldn't send that partition down. This is +Issue #3+.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)