You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Csaba Ringhofer (JIRA)" <ji...@apache.org> on 2018/09/13 14:34:00 UTC

[jira] [Closed] (IMPALA-7567) Implement timezone aware parquet stat filtering for timestamp columns

     [ https://issues.apache.org/jira/browse/IMPALA-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Csaba Ringhofer closed IMPALA-7567.
-----------------------------------
    Resolution: Duplicate

Created by mistake.

> Implement timezone aware parquet stat filtering for timestamp columns
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-7567
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7567
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: parquet, timestamp
>
> Parquet timestamp columns can contain UTC normalized data, which means that the data is stored in UTC but it is expected to be shown  in local time (to be consistent with Hive). This is done by converting these timestamp from UTC to local time during scanning.
> This conversion has to be considered during min/max stat filtering, otherwise some row groups can be incorrectly skipped. For this reason IMPALA-7559 disables stat filtering on UTC normalized timestamp columns. 
> This ticket deals with creating a correct implementation to be able re-enable stat filtering for these columns.
> DST and historical rule changes add some complexity to this. UTC->local mapping can be non-monotonous, and  local->UTC mapping can be ambiguous. The non-monotonous mapping means that if tMin <= t <= tMax is true in UTC does not imply that the same is true in local time.
> The solution I see is to convert min/max of the predicate from local to UTC and resolve ambiguity by  choosing the earlier time in case of min, and the later time in case of max. These UTC values can be compared with stats safely.
> Note the timezone rules can be different in Hive and Impala (especially historical ones), so we cannot ensure that Impala gives exactly the same results as Hive. The goal is to ensure that Impala returns the same rows with and without stat filtering.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org