You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Gang Wu (JIRA)" <ji...@apache.org> on 2019/01/13 21:23:00 UTC

[jira] [Commented] (ORC-451) Timestamp statistics is wrong if read with useUTCTimestamp=true

    [ https://issues.apache.org/jira/browse/ORC-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16741658#comment-16741658 ] 

Gang Wu commented on ORC-451:
-----------------------------

I have checked the attached file using different ORC tools. The writer timezone in stripe footer is "UTC" and C++ orc-statistics tool shows the following statistics which is correct:
{quote}--- Column 1 ---
Data type: Timestamp
Values: 1
Has null: no
Minimum: 2018-12-24 18:30:00.0
LowerBound: 2018-12-24 18:30:00.0
Maximum: 2018-12-24 18:30:00.0
UpperBound: 2018-12-24 18:30:00.1
{quote}
However, when I used Java tool to print its statistics, I got the following result:
{quote}$ java -jar orc-tools-1.6.0-SNAPSHOT-uber.jar meta ~/Downloads/000000_0

Stripe Statistics:
 Stripe 1:
 Column 0: count: 1 hasNull: false
 Column 1: count: 1 hasNull: false bytesOnDisk: 15 min: 2018-12-24 18:30:00.0 max: 2018-12-24 18:30:00.0 min UTC: 2018-12-24 10:30:00.0 max UTC: 2018-12-24 10:30:00.0
{quote}
 

If I set the timezone of JVM to *Europe/Moscow*, I can reproduce your result:
{quote}$ java -Duser.timezone=Europe/Moscow -jar orc-tools-1.6.0-SNAPSHOT-uber.jar meta ~/Downloads/000000_0

Stripe Statistics:
 Stripe 1:
 Column 0: count: 1 hasNull: false
 Column 1: count: 1 hasNull: false bytesOnDisk: 15 min: 2018-12-24 18:30:00.0 max: 2018-12-24 18:30:00.0 min UTC: 2018-12-24 21:30:00.0 max UTC: 2018-12-24 21:30:00.0
{quote}
It seems that min/max UTC fields in the above result are related to JVM timezone setting. Therefore if I set the JVM timezone to *UTC*, the result is finally correct:
{quote}$ java -Duser.timezone=UTC -jar orc-tools-1.6.0-SNAPSHOT-uber.jar meta ~/Downloads/000000_0



Stripe Statistics:
 Stripe 1:
 Column 0: count: 1 hasNull: false
 Column 1: count: 1 hasNull: false bytesOnDisk: 15 min: 2018-12-24 18:30:00.0 max: 2018-12-24 18:30:00.0 min UTC: 2018-12-24 18:30:00.0 max UTC: 2018-12-24 18:30:00.0
{quote}
 

I think the root cause is here: [https://github.com/apache/orc/blob/rel/release-1.5.4/java/core/src/java/org/apache/orc/impl/ColumnStatisticsImpl.java#L1434]
{quote}@Override
 public Timestamp getMinimumUTC() {
    return minimum == null ? null : new Timestamp(minimum);
 }

@Override
 public Timestamp getMaximumUTC() {
    return maximum == null ? null : new Timestamp(maximum);
 }
{quote}
The code snippet actually interprets the timestamp based on JVM timezone. I am not sure how does Hive side handle this, it seems to me that ORC should interpret the timestamp in UTC timezone to recover its original timestamp string which is '2018-12-24 18:30:00.0' in the example above. [~owen.omalley] [~prasanth_j] [~jcamachorodriguez] Any thoughts?

> Timestamp statistics is wrong if read with useUTCTimestamp=true
> ---------------------------------------------------------------
>
>                 Key: ORC-451
>                 URL: https://issues.apache.org/jira/browse/ORC-451
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>         Environment: timezone for both client and server "Europe/Moscow" (UTC+3)
> hive version 3.1.0.3.0.1.0-187
>            Reporter: Rei Mai
>            Priority: Major
>         Attachments: 000000_0
>
>
> We're using external orc tables and a timezone "Europe/Moscow" (UTC+3) for both client and server. After switching to hive 3 which uses orc 1.5.x we've got an issue with predicate push down filtering out matching stripes by timestamp. E.g. consider a table (it's orc data is in the attachment):
> {quote}{{create external table test_ts (ts timestamp) stored as orc;}}
> {{insert into test_ts values ("2018-12-24 18:30:00");}}
> {{// No rows selected}}
> {{select * from test_ts where ts < "2018-12-24 19:00:00";}}
> // the lowest filter to return the value
> {{select * from test_ts where ts <= "2018-12-24 21:30:00";}}
> {quote}
> The issue only affects external orc tables statistics. Turning ppd off with _set hive.optimize.index.filter=false;_ helps.
> We believe it was the https://jira.apache.org/jira/browse/ORC-341, which introduced it.
> org.apache.orc.impl.SerializationUtils utc convertion is rather strange:
> {quote}public static long convertToUtc(TimeZone local, long time){
>  {color:#cc7832}  int {color}offset = local.getOffset(time - local.getRawOffset()){color:#cc7832};{color}{color:#cc7832}  return {color}time - offset{color:#cc7832};{color}
>  }
> {quote}
> This adds a 3 hour offset to our timestamp in UTC+3 timezone (shouldn't it substract 3 hours, btw?).
> If org.apache.orc.impl.TimestampStatisticsImpl is used with useUTCTimestamp=false, the timestamp is converted back in a compatible way via SerializationUtils.convertFromUtc. But hive seems to override default org.apache.orc.OrcFile.ReaderOptions with org.apache.hadoop.hive.ql.io.orc.ReaderOptions which have useUTCTimestamp(true) in it's constructor. With useUTCTimestamp=true evaluatePredicateProto predictate is using  TimestampStatisticsImpl.getMaximumUTC(), which returns the timestamp as is, i.e. in the example it's "2018-12-24 21:30:00 UTC+3".
> At the same time the predicate is not shifted (the value in this tez log is in UTC+3):
> {quote}2018-12-24 22:12:16,205 [INFO] [InputInitializer \{Map 1} #0|#0] |orc.OrcInputFormat|: ORC pushdown predicate: leaf-0 = (LESS_THAN ts 2018-12-24 19:00:00.0), expr = leaf-0
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)