You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2018/11/14 20:24:01 UTC

[jira] [Commented] (IMPALA-7559) Parquet stat filtering ignores convert_legacy_hive_parquet_utc_timestamps

    [ https://issues.apache.org/jira/browse/IMPALA-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687082#comment-16687082 ] 

ASF subversion and git services commented on IMPALA-7559:
---------------------------------------------------------

Commit 60095a4c6bebc412a040d5b4a723e528ba0b2278 in impala's branch refs/heads/master from [~csringhofer]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=60095a4 ]

IMPALA-5050: Add support to read TIMESTAMP_MILLIS and TIMESTAMP_MICROS from Parquet

Changes:
- parquet.thrift is updated to a newer version which contains the
  timestamp logical type.
- INT64 columns with converted types TIMESTAMP_MILLIS and
  TIMESTAMP_MICROS can be read as TIMESTAMP.
- If the logical type is timestamp, then the type will contain the
  information whether the UTC->local conversion is necessary. This
  feature is only supported for the new timestamp types, so INT96
  timestamps must still use flag
  convert_legacy_hive_parquet_utc_timestamps.
- Min/max stat filtering is enabled again for columns that need
  UTC->local conversion. This was disabled in IMPALA-7559 because
  it could incorrectly drop column chunks.
- CREATE TABLE LIKE PARQUET converts these columns to
  TIMESTAMP - before the change, an error was returned instead.
- Bulk of the Parquet column stat logic was moved to a new class
  called "ColumnStatsReader".

Testing:
- Added unit tests for timezone conversion (this needed a new public
  function in timezone_db.h and adding CET to tzdb_tiny).
- Added parquet files (created with parquet-mr) with int64 timestamp
  columns.

Change-Id: I4c7c01fffa31b3d2ca3480adf6ff851137dadac3
Reviewed-on: http://gerrit.cloudera.org:8080/11057
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Parquet stat filtering ignores convert_legacy_hive_parquet_utc_timestamps
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-7559
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7559
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Blocker
>              Labels: correctness, parquet, wrongresults
>             Fix For: Impala 3.1.0
>
>
> UPDATE: the issue turned out to be different than I first thought, see my last comment. I will update the description with more details later.
> If the min/max value of a timestamp column chunk is during the hour of the Summer->Winter dst change (UTC+2 -> UTC+1 in CET) then stat filtering can drop row groups that contain rows that would be "ok" for the predicate otherwise.
> To reproduce (on current master branch):
> {code}
> 1. it is assumed that the timezone is CET and that flag convert_legacy_hive_parquet_utc_timestamps is enabled
> ( export TZ=CET; bin/start-impala-cluster.py --impalad_args="-convert_legacy_hive_parquet_utc_timestamps=true" )
> 2. create a table in hive and fill data in 3 inserts to create 3 files:
> create table t (i int, d timestamp) stored as parquet;
> insert into t values (1, "2017-10-29 02:30:00"), (2, "2018-10-28 02:30:00");
> insert into t values (3, "2018-10-28 02:30:00");
> insert into t values (4, "2017-10-29 02:30:00")
> 3. Query from Impala
> set num_nodes=1;
> select * from t; -- returns all 4 values (same as Hive) 
> select * from t where d = "2017-10-29 02:30:00"; -- returns 1 in Impala (Hive returns 1,4)
> select * from t where d = "2018-10-28 02:30:00"; -- returns 2 in Impala (Hive returns 2,3)
> profile; -- NumStatsFilteredRowGroups: 2 (only one row group should have been stat filtered)
> select * from t where d = "2018-10-28 02:30:00" or i = 5; -- returns 2 and 3 in Impala (same as Hive), because the "or" part disabled stat filtering
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org