You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2019/12/09 14:25:00 UTC

[jira] [Commented] (IMPALA-8184) Add timestamp validation to Orc scanner

    [ https://issues.apache.org/jira/browse/IMPALA-8184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16991631#comment-16991631 ] 

ASF subversion and git services commented on IMPALA-8184:
---------------------------------------------------------

Commit f33a9d0d426f2cbaaf225d7ea08b15966e537f31 in impala's branch refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f33a9d0 ]

IMPALA-8184: Add timestamp validation to ORC scanner

Hive can write timestamps that are outside Impala's valid
range (Impala: 1400-9999 Hive: 0001-9999). This change adds
validation logic to ORC reading that replaces out-of-range
timestamps with NULLs and adds a warning to the query.

The logic is very similar to the existing validation in
Parquet. Some differences:
- "time of day" is not checked separately as it doesn't make
  sense with ORC's encoding
- instead of column name only column id is added to the warning

Testing:
- added a simple EE test that scans an existing ORC file

Change-Id: I8ee2ba83a54f93d37e8832e064f2c8418b503490
Reviewed-on: http://gerrit.cloudera.org:8080/14832
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Add timestamp validation to Orc scanner
> ---------------------------------------
>
>                 Key: IMPALA-8184
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8184
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Critical
>
> Similarly to Parquet, Orc can also contain timestamps that are not valid in Impala, e.g. Hive can insert timestamps before 1400 while these are invalid in Impala. These invalid timestamps are often handled similarly to NULL, bur are actually not "real" NULLs, which can lead to some some weird behavior:
> Hive:
> create table orcts (ts timestamp) stored as orc;
> insert into orcts values ("1200-01-01");
> Impala:
> select * from orcts where ts is not null;
> Returns 1 row:
> NULL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org