You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@impala.apache.org by Boris Tyukin <bo...@boristyukin.com> on 2018/12/19 19:31:57 UTC

timestamps and time zones

Hello,

I am trying to understand the reasons behind this decision by Impala devs.

From Impala docs:
http://impala.apache.org/docs/build/html/topics/impala_timestamp.html

By default, Impala stores and interprets TIMESTAMP values in UTC time zone
when writing to data files, reading from data files, or converting to and
from system time values through functions.

And there are there two switches to change this behavior:

use_local_tz_for_unix_timestamp_conversions
convert_legacy_hive_parquet_utc_timestamps (performance killer that has
just been fixed in the latest Impala release which has not made to CDH yet)

My question is what are the thought process and reasons to do this
conversion in the first place  from UTC and having Impala "assume" that
timestamp is always UTC?

This is not how Hive or Spark or anything else I've seen before does it.
This is really unusual and causes tons of confusion if you try to use the
same data set from Hive, Spark and Impala, so when Impala is not the only
thing on a cluster.

And second option, why there is no option NOT to convert the time in the
first place and just use the one which was intended to be stored? So if I
stored 2015-01-01 12:12:00 whatever time zone time is, I still want to see
that exact time in Impala, Hive and Spark and I do not need Impala
converting this time to my local cluster time.

I am sure there is a reason for that just struggling to understand it...

Thanks,
Boris

Re: timestamps and time zones

Posted by Boris Tyukin <bo...@boristyukin.com>.

this is amazing how much complexity is behind something simple as time :)
thanks so much for a detailed write-up, Csaba. It really helps and makes
sense now. I do wish there was an option similar to  use_local_tz_for_unix_
timestamp_conversions but the one that would not attempt to convert
timestamp to a server time zone, granted performance will be worth because
of 2 step conversion.

In our case, we have 100s of tables with thousands of timestamps. Data is
coming from 3 different time zones and our requirement is to store these
values in local time to those systems (so users can see exact local time as
it was in their local system). We used to keep time in UTC but it caused
tons of issues and confusion with users (while we liked the simplicity of
having everything in UTC).

Based on what you said, it looks like we have two options:
1) upgrade to CDH 6.1 and use convert_legacy_hive_parquet_utc_timestamps
(which before was not usable for us due to huge performance impact)
2) convert and store time as we need but using string type not timestamp.
Luckily, Hive, Impala and Spark can accept string values as arguments to
data/time functions.

Boris


On Thu, Dec 20, 2018 at 10:28 AM Csaba Ringhofer <cs...@cloudera.com>
wrote:

> > My question is what are the thought process and reasons to do this
> conversion in the first place  from UTC and having Impala "assume" that
> timestamp is always UTC?
> tl;dr: "timezone agnostic" is the fastest and simplest way to go,
> especially if there is no good time zone library available
>
> I was not there when it was designed, but I think that the main reason
> behind Impala's behavior is performance, and the secondary reason is that
> finding and integrating a good timezone library is not trivial in C++. Note
> that there are some nice time libraries, but most are not designed for
> processing billions of records.
>
> Impala uses a compact in memory format for timestamps (32 bit days since
> epoch and 64 bits nanoseconds since midnight)
> Storing timestamps as string or as more complex structure (e.g. separate
> numbers for year, month, day ..) would need more memory + make some
> operations like comparison slower.
>
> During a query like "SELECT timestamp_col FROM parquet_table" the
> timestamps may need time zone conversion during two steps:
> 1. converting the timestamp from the way it is encoded in Parquet to
> Impala's internal
> 2. converting Impala's internal format to string like 2018-12-24 20:00:00,
> as the result is returned as string
>
> By default Impala avoids timezone conversion in both steps by treating
> timestamp as if they were in UTC:
> 1. Parquet uses the same binary format as Impala, so decoding can be as
> simple as copying 12 bytes in memory
> 2. formatting as string needs some tricky calculations due to leap years,
> but does not need time zone conversion
>
> convert_legacy_hive_parquet_utc_timestamps  adds timezone conversion to
> step 1 (if the writer is parquet-mr), which makes reading slower even with
> recent improvements. There is no option to do time zone conversion during
> step 2.
>
> Hive and Spark assume the stored timestamp to be in UTC, but displays it
> in local time. I do not know how these software represent timestamps in
> memory, but I am sure that they need to do time zone conversion in step 1
> or 2. This overhead may be less visible than in Impala, because most
> operations are already slower, and the time library used is probably better
> (especially compared to the pre 3.1 Impala).
>
> >(performance killer that has just been fixed in the latest Impala release
> which has not made to CDH yet)
> Note that CDH 6.1 was released recently, and timezone database changes are
> included in the release, see:
>
> https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_610_new_features.html#impala_new_610
>
> https://blog.cloudera.com/blog/2018/12/cloudera-enterprise-6-1-0-is-now-available/
>
> On Wed, Dec 19, 2018 at 9:47 PM Boris Tyukin <bo...@boristyukin.com>
> wrote:
>
>> wow thanks for sharing this doc, a wealth of info there. Thanks Greg!
>>
>> On Wed, Dec 19, 2018 at 3:17 PM Greg Rahn <gr...@gmail.com> wrote:
>>
>>> I think this document contains most of the challenges around timestamp
>>> definition and management.  It's a long read but has the details
>>> behind much of what you have mentioned.
>>>
>>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit?usp=sharing
>>>
>>>
>>>
>>> On Wed, Dec 19, 2018 at 11:33 AM Boris Tyukin <bo...@boristyukin.com>
>>> wrote:
>>> >
>>> > Hello,
>>> >
>>> > I am trying to understand the reasons behind this decision by Impala
>>> devs.
>>> >
>>> > From Impala docs:
>>> > http://impala.apache.org/docs/build/html/topics/impala_timestamp.html
>>> >
>>> > By default, Impala stores and interprets TIMESTAMP values in UTC time
>>> zone when writing to data files, reading from data files, or converting to
>>> and from system time values through functions.
>>> >
>>> > And there are there two switches to change this behavior:
>>> >
>>> > use_local_tz_for_unix_timestamp_conversions
>>> > convert_legacy_hive_parquet_utc_timestamps (performance killer that
>>> has just been fixed in the latest Impala release which has not made to CDH
>>> yet)
>>> >
>>> > My question is what are the thought process and reasons to do this
>>> conversion in the first place  from UTC and having Impala "assume" that
>>> timestamp is always UTC?
>>> >
>>> > This is not how Hive or Spark or anything else I've seen before does
>>> it. This is really unusual and causes tons of confusion if you try to use
>>> the same data set from Hive, Spark and Impala, so when Impala is not the
>>> only thing on a cluster.
>>> >
>>> > And second option, why there is no option NOT to convert the time in
>>> the first place and just use the one which was intended to be stored? So if
>>> I stored 2015-01-01 12:12:00 whatever time zone time is, I still want to
>>> see that exact time in Impala, Hive and Spark and I do not need Impala
>>> converting this time to my local cluster time.
>>> >
>>> > I am sure there is a reason for that just struggling to understand
>>> it...
>>> >
>>> > Thanks,
>>> > Boris
>>>
>>

Re: timestamps and time zones

Posted by Csaba Ringhofer <cs...@cloudera.com>.

> My question is what are the thought process and reasons to do this
conversion in the first place  from UTC and having Impala "assume" that
timestamp is always UTC?
tl;dr: "timezone agnostic" is the fastest and simplest way to go,
especially if there is no good time zone library available

I was not there when it was designed, but I think that the main reason
behind Impala's behavior is performance, and the secondary reason is that
finding and integrating a good timezone library is not trivial in C++. Note
that there are some nice time libraries, but most are not designed for
processing billions of records.

Impala uses a compact in memory format for timestamps (32 bit days since
epoch and 64 bits nanoseconds since midnight)
Storing timestamps as string or as more complex structure (e.g. separate
numbers for year, month, day ..) would need more memory + make some
operations like comparison slower.

During a query like "SELECT timestamp_col FROM parquet_table" the
timestamps may need time zone conversion during two steps:
1. converting the timestamp from the way it is encoded in Parquet to
Impala's internal
2. converting Impala's internal format to string like 2018-12-24 20:00:00,
as the result is returned as string

By default Impala avoids timezone conversion in both steps by treating
timestamp as if they were in UTC:
1. Parquet uses the same binary format as Impala, so decoding can be as
simple as copying 12 bytes in memory
2. formatting as string needs some tricky calculations due to leap years,
but does not need time zone conversion

convert_legacy_hive_parquet_utc_timestamps  adds timezone conversion to
step 1 (if the writer is parquet-mr), which makes reading slower even with
recent improvements. There is no option to do time zone conversion during
step 2.

Hive and Spark assume the stored timestamp to be in UTC, but displays it in
local time. I do not know how these software represent timestamps in
memory, but I am sure that they need to do time zone conversion in step 1
or 2. This overhead may be less visible than in Impala, because most
operations are already slower, and the time library used is probably better
(especially compared to the pre 3.1 Impala).

>(performance killer that has just been fixed in the latest Impala release
which has not made to CDH yet)
Note that CDH 6.1 was released recently, and timezone database changes are
included in the release, see:
https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_610_new_features.html#impala_new_610
https://blog.cloudera.com/blog/2018/12/cloudera-enterprise-6-1-0-is-now-available/

On Wed, Dec 19, 2018 at 9:47 PM Boris Tyukin <bo...@boristyukin.com> wrote:

> wow thanks for sharing this doc, a wealth of info there. Thanks Greg!
>
> On Wed, Dec 19, 2018 at 3:17 PM Greg Rahn <gr...@gmail.com> wrote:
>
>> I think this document contains most of the challenges around timestamp
>> definition and management.  It's a long read but has the details
>> behind much of what you have mentioned.
>>
>> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit?usp=sharing
>>
>>
>>
>> On Wed, Dec 19, 2018 at 11:33 AM Boris Tyukin <bo...@boristyukin.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > I am trying to understand the reasons behind this decision by Impala
>> devs.
>> >
>> > From Impala docs:
>> > http://impala.apache.org/docs/build/html/topics/impala_timestamp.html
>> >
>> > By default, Impala stores and interprets TIMESTAMP values in UTC time
>> zone when writing to data files, reading from data files, or converting to
>> and from system time values through functions.
>> >
>> > And there are there two switches to change this behavior:
>> >
>> > use_local_tz_for_unix_timestamp_conversions
>> > convert_legacy_hive_parquet_utc_timestamps (performance killer that has
>> just been fixed in the latest Impala release which has not made to CDH yet)
>> >
>> > My question is what are the thought process and reasons to do this
>> conversion in the first place  from UTC and having Impala "assume" that
>> timestamp is always UTC?
>> >
>> > This is not how Hive or Spark or anything else I've seen before does
>> it. This is really unusual and causes tons of confusion if you try to use
>> the same data set from Hive, Spark and Impala, so when Impala is not the
>> only thing on a cluster.
>> >
>> > And second option, why there is no option NOT to convert the time in
>> the first place and just use the one which was intended to be stored? So if
>> I stored 2015-01-01 12:12:00 whatever time zone time is, I still want to
>> see that exact time in Impala, Hive and Spark and I do not need Impala
>> converting this time to my local cluster time.
>> >
>> > I am sure there is a reason for that just struggling to understand it...
>> >
>> > Thanks,
>> > Boris
>>
>

Re: timestamps and time zones

Posted by Boris Tyukin <bo...@boristyukin.com>.

wow thanks for sharing this doc, a wealth of info there. Thanks Greg!

On Wed, Dec 19, 2018 at 3:17 PM Greg Rahn <gr...@gmail.com> wrote:

> I think this document contains most of the challenges around timestamp
> definition and management.  It's a long read but has the details
> behind much of what you have mentioned.
>
> https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit?usp=sharing
>
>
>
> On Wed, Dec 19, 2018 at 11:33 AM Boris Tyukin <bo...@boristyukin.com>
> wrote:
> >
> > Hello,
> >
> > I am trying to understand the reasons behind this decision by Impala
> devs.
> >
> > From Impala docs:
> > http://impala.apache.org/docs/build/html/topics/impala_timestamp.html
> >
> > By default, Impala stores and interprets TIMESTAMP values in UTC time
> zone when writing to data files, reading from data files, or converting to
> and from system time values through functions.
> >
> > And there are there two switches to change this behavior:
> >
> > use_local_tz_for_unix_timestamp_conversions
> > convert_legacy_hive_parquet_utc_timestamps (performance killer that has
> just been fixed in the latest Impala release which has not made to CDH yet)
> >
> > My question is what are the thought process and reasons to do this
> conversion in the first place  from UTC and having Impala "assume" that
> timestamp is always UTC?
> >
> > This is not how Hive or Spark or anything else I've seen before does it.
> This is really unusual and causes tons of confusion if you try to use the
> same data set from Hive, Spark and Impala, so when Impala is not the only
> thing on a cluster.
> >
> > And second option, why there is no option NOT to convert the time in the
> first place and just use the one which was intended to be stored? So if I
> stored 2015-01-01 12:12:00 whatever time zone time is, I still want to see
> that exact time in Impala, Hive and Spark and I do not need Impala
> converting this time to my local cluster time.
> >
> > I am sure there is a reason for that just struggling to understand it...
> >
> > Thanks,
> > Boris
>

Re: timestamps and time zones

Posted by Greg Rahn <gr...@gmail.com>.

I think this document contains most of the challenges around timestamp
definition and management.  It's a long read but has the details
behind much of what you have mentioned.
https://docs.google.com/document/d/1gNRww9mZJcHvUDCXklzjFEQGpefsuR_akCDfWsdE35Q/edit?usp=sharing



On Wed, Dec 19, 2018 at 11:33 AM Boris Tyukin <bo...@boristyukin.com> wrote:
>
> Hello,
>
> I am trying to understand the reasons behind this decision by Impala devs.
>
> From Impala docs:
> http://impala.apache.org/docs/build/html/topics/impala_timestamp.html
>
> By default, Impala stores and interprets TIMESTAMP values in UTC time zone when writing to data files, reading from data files, or converting to and from system time values through functions.
>
> And there are there two switches to change this behavior:
>
> use_local_tz_for_unix_timestamp_conversions
> convert_legacy_hive_parquet_utc_timestamps (performance killer that has just been fixed in the latest Impala release which has not made to CDH yet)
>
> My question is what are the thought process and reasons to do this conversion in the first place  from UTC and having Impala "assume" that timestamp is always UTC?
>
> This is not how Hive or Spark or anything else I've seen before does it. This is really unusual and causes tons of confusion if you try to use the same data set from Hive, Spark and Impala, so when Impala is not the only thing on a cluster.
>
> And second option, why there is no option NOT to convert the time in the first place and just use the one which was intended to be stored? So if I stored 2015-01-01 12:12:00 whatever time zone time is, I still want to see that exact time in Impala, Hive and Spark and I do not need Impala converting this time to my local cluster time.
>
> I am sure there is a reason for that just struggling to understand it...
>
> Thanks,
> Boris