You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Joe McDonnell (JIRA)" <ji...@apache.org> on 2018/04/11 22:18:00 UTC
[jira] [Closed] (IMPALA-6830) HdfsScanner get stale data when Hive table is overwrited

     [ https://issues.apache.org/jira/browse/IMPALA-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joe McDonnell closed IMPALA-6830.
---------------------------------
    Resolution: Not A Bug

This is behavior that we expect from the file handle cache, which was enabled in Impala 2.12. 

When Hive does an insert overwrite, it is often using a deterministic naming system for the files, so it is overwriting file X with different data. Due to file handle caching, Impala continues to have an HDFS file handle for the file with name X. Impala does not know it has changed, so it continues to use the file handle it already has open. For a while after the overwrite, this file will continue to see the old version. HDFS file handles can have a regular UNIX file handle which continues to look at the old OS file (which is still around due to the UNIX file handle). HDFS does notice when an HDFS file is deleted or overwritten and it will invalidate the HDFS file handle. After that happens, Impala will get an error and then see the new version of the file.

Refreshing the table causes Impala to notice that the file has changed (and has a different mtime). It will not use a cached file handle with a different mtime, so this means it opens a new HDFS file handle and sees the new data. Existing queries might finish with the old handle, but new queries will use the new handle.

> HdfsScanner get stale data when Hive table is overwrited
> --------------------------------------------------------
>
>                 Key: IMPALA-6830
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6830
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Quanlong Huang
>            Assignee: Joe McDonnell
>            Priority: Major
>
> In the minicluster:
> {code:bash}
> hive> create table tmp_parq (a int, b string, c int) stored as parquet;
> hive> insert overwrite table tmp_parq select 1, "abc", 2;
> impala> select * from tmp_parq;
> +---+-----+---+
> | a | b   | c |
> +---+-----+---+
> | 1 | abc | 2 |
> +---+-----+---+
> hive> insert overwrite table tmp_parq select 100, "ddd", 200;
> # # impala still gets old results:
> impala> select * from tmp_parq;
> +---+-----+---+
> | a | b   | c |
> +---+-----+---+
> | 1 | abc | 2 |
> +---+-----+---+
> # # It can be fixed after REFRESH
> impala> refresh tmp_parq;
> impala> select * from tmp_parq;
> +-----+-----+-----+
> | a   | b   | c   |
> +-----+-----+-----+
> | 100 | ddd | 200 |
> +-----+-----+-----+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)