You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Panagiotis Garefalakis (Jira)" <ji...@apache.org> on 2020/03/16 16:29:00 UTC

[jira] [Commented] (ORC-611) Incorrect min-max stats for sub-millisecond timestamps

    [ https://issues.apache.org/jira/browse/ORC-611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060355#comment-17060355 ] 

Panagiotis Garefalakis commented on ORC-611:
--------------------------------------------

[~csringhofer] /  [~stigahuang] thanks for reporting this! 
Just created a PR that fixes the column stats precision issue for Timestamps – also added some tests to verify the expected behaviour.

 [~jcamachorodriguez] [~gopalv] [~ashutoshc] thoughts?

We should also add a followup ticket to change the HIVE Reader behaviour for ORC files written with existing logic.

> Incorrect min-max stats for sub-millisecond timestamps
> ------------------------------------------------------
>
>                 Key: ORC-611
>                 URL: https://issues.apache.org/jira/browse/ORC-611
>             Project: ORC
>          Issue Type: Bug
>          Components: C++, Java
>            Reporter: Csaba Ringhofer
>            Assignee: Panagiotis Garefalakis
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The issue is related to the precision of storing timestamps:
> - nanoseconds for the data itself
> - only milliseconds for min-max statistics
> Both min and max are rounded to the same value, while min should be rounded down and max should be rounded up to ensure that the values are actually within that range.
> Repro in Hive:
> {code}
> create table tsstat (ts timestamp) stored as orc;
> insert into tsstat values ("1970-01-01 00:00:00.0005")
> select * from tsstat where ts = "1970-01-01 00:00:00.0005";
> -- returned 0 rows
> {code}
> Both the Java and the C++ writer has this issue (thanks [~stigahuang] for looking them up):
> https://github.com/apache/orc/blob/fea154436c37c81a16b13d879b510096cfaa2946/java/core/src/java/org/apache/orc/impl/writer/TimestampTreeWriter.java#L108
> https://github.com/apache/orc/blob/fea154436c37c81a16b13d879b510096cfaa2946/c%2B%2B/src/ColumnWriter.cc#L1800
> I guess that there are already files with this issue in production, so I think that the only way to fix this is to hack the reader:
> - decrease/increase min/max stats with 1 ms after reading them from file
> - also be careful about the values pushed down, as the same precision loss can occur there to, eg. "WHERE ts <'1970-01-01 00:00:00.0005' AND ts > '1970-01-01 00:00:00.0004'" shouldn't be converted to ts < "1970-01-01" AND ts > "1970-01-01"
> The issue was discovered during an Impala review: https://gerrit.cloudera.org/#/c/15403/1/be/src/exec/hdfs-orc-scanner.cc@875



--
This message was sent by Atlassian Jira
(v8.3.4#803005)