You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Csaba Ringhofer <cs...@cloudera.com> on 2019/02/11 17:13:25 UTC

Re: Timestamps with less than nano precision: rounding vs truncating

Thanks for the feedback!

I will use rounding towards negative infinity in the implementation of
Parquet int64 timestamp writers.
+ I created a ticket about changing the rounding when writing Kudu
tables: IMPALA-8180



On Fri, Jan 25, 2019 at 7:54 PM Jim Apple <jb...@cloudera.com> wrote:

> Got it. After reading again, I think I understand the choices better. I
> like your idea to keep consistency with Hive and change the kudu writer
> timestamp rounding mode in Impala 4.0.
>
> On Thu, Jan 24, 2019 at 4:35 PM Csaba Ringhofer <cs...@cloudera.com>
> wrote:
>
> > Sorry, my  wording was bad  (I realised this after reading
> > en.wikipedia.org/wiki/Rounding): when I wrote "truncation towards
> negative
> > intinity" I meant "rounding towards minus infinity". I didn't know that
> > truncation means rounding towards 0.
> >
> > On Thu, Jan 24, 2019 at 8:53 PM Jim Apple <jb...@cloudera.com> wrote:
> >
> > > You use the phrase "truncation towards negative infinity". Can you
> give a
> > > concrete example where "truncation towards negative infinity" is
> > different
> > > from "truncation" and another where "truncation towards negative
> > infinity"
> > > is different from "rounding towards negative infinity"? As I mentioned
> > > above, I am not aware of a rounding mode entitled "truncation towards
> > > negative infinity".
> > >
> > > On Thu, Jan 24, 2019 at 11:41 AM Csaba Ringhofer <
> > csringhofer@cloudera.com
> > > >
> > > wrote:
> > >
> > > > Thanks for the comments!
> > > >
> > > > > Can you give a concrete example of an query where you are
> proposing a
> > > > change?
> > > > create table tkudu (id int primary key, t timestamp) stored as kudu;
> > > > insert into tkudu values
> > > >   (1,"1970-01-01 00:00:00.1111111"), -- all sub-second parts are 7
> > digit
> > > >   (2,"1970-01-01 23:59:59.9999999"),
> > > >   (3,"1969-12-31 23:59:59.9999999");
> > > > select * from tkudu;
> > > >
> > > > This currently returns:
> > > > 1,1970-01-01 00:00:00.111111000
> > > > 2,1970-01-02 00:00:00
> > > > 3,1970-01-01 00:00:00
> > > >
> > > > 1 was rounded down to microsec precision, while 2 and 3 were rounded
> up
> > > and
> > > > also stepped to another way.
> > > > With truncation towards negative infinity the query would return
> this:
> > > > 1,1970-01-01 00:00:00.111111000
> > > > 2,1970-01-01 23:59:59.999999000
> > > > 3,1969-12-31 23:59:59.999999000
> > > > So 1 would be the same, and 2 and 3 would be truncated from 7 to 6
> > digits
> > > > and would not step to a new day.
> > > >
> > > > My goal is not to change how Impala writes Kudu, but to get to a
> > > consensus
> > > > before going forward with writing Parquet milli/micro timestamps. (
> > > > https://gerrit.cloudera.org/#/c/12247/ )
> > > >
> > > > >That said, Oracle, Netezza, Vertica, and Postgres all round. Db2
> > > > truncates.
> > > > Thanks, good to know!
> > > > So it looks like that rounding is more popular, but consistency with
> > Hive
> > > > may be more important in Impala's case.
> > > >
> > > > On Thu, Jan 24, 2019 at 7:59 PM Greg Rahn <gr...@gmail.com>
> wrote:
> > > >
> > > > > For things like this the ISO SQL spec states  "the choice of
> whether
> > to
> > > > > round or truncate is implementation-defined".  That said, Oracle,
> > > > Netezza,
> > > > > Vertica, and Postgres all round. Db2 truncates.
> > > > >
> > > > > On Wed, Jan 23, 2019 at 12:26 PM Csaba Ringhofer <
> > > > csringhofer@cloudera.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Timestamps are often represented as ticks since some epoch, e.g.
> > > > > 1970.01.01
> > > > > > 00:00:00, so negative timestamps make sense as times before the
> > epoch
> > > > - I
> > > > > > meant rounding vs truncating towards 0 vs rounding towards
> negative
> > > > > > infinite in this sense. Truncating towards negative infinity
> means
> > > that
> > > > > > timestamps are always truncated to an earlier timestamp.
> Truncating
> > > > > towards
> > > > > > 0 would mean that before 1970, timestamps are truncated upwards,
> > > which
> > > > > can
> > > > > > lead to similar troubles as the I ones mentioned with rounding.
> On
> > > x86
> > > > > c++,
> > > > > > when a time_t is divided by an integer, the result is rounded
> > towards
> > > > 0,
> > > > > so
> > > > > > a naive implementation that uses time_t to represent timestamps
> can
> > > > > > truncate towards 0, but In impala::TimestampValue, time_ should
> be
> > > the
> > > > > > non-negative nanoseconds since midnight, so it can be simply
> > divided
> > > > with
> > > > > > 1000 to convert from nanoseconds to microseconds.
> > > > > >
> > > > >
> > > >
> > >
> >
>