You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@impala.apache.org by Franco Venturi <fv...@comcast.net> on 2017/12/16 23:38:30 UTC

Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)


Please note that the discussion below refers to the following versions: 
- Impala: v2.10.0-cdh5.13.0 
- Kudu: 1.5.0-cdh5.13.0 
- Everything runs on a standard Cloudera 5.13 installation 




A few days ago I was writing some Java code to migrate several tables directly from Oracle to Kudu (to be queried later on by our developers and BI tools using Impala). Most of these tables have columns that are of type "timestamp" (to be exact, they come in as instances of class oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of this discussion I'll assume we only deal with objects of java.sql.Timestamp, to make things simple). 
As you probably know, Kudu, starting I think with version 1.3.1, has a type called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala TIMESTAMP" data type (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html). 




A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the 'Apache Kudu Schema Design' document (https://kudu.apache.org/docs/schema_design.html), which says: 




unixtime_micros (64-bit microseconds since the Unix epoch) 




where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT. 




With this understanding I went ahead and wrote my Java code; when I ran the first few tests, I noticed that the timestamp values returned by Impala (I created in Impala an 'external' table 'stored as kudu') were off by several hours compared to the values returned by the original table in Oracle (our servers, both the Oracle ones and the Impala/Kudu ones, are all configured in the 'America/New_York' timezone). 




To investigate this difference, I created a simple table in Kudu with just two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a timestamp. I ran a few inserts and selects over this table in Impala and figured out that Impala stores a value that is more or less defined as follow: 




number of microseconds since the Unix epoch (i.e. what I was expecting originally) 
+ offset of the timestamp I inserted with respect to GMT (in my case this offset is the offset for EST or EDT depending if that timestamp was during EST (winter) or EDT (summer)) 


This is how Impala achieves what is described as: 




Impala does not store timestamps using the local timezone, to avoid undesired results from unexpected time zone issues 




That same page has caveats like the following, that sent a shiver down my spine: 




If that value was written to a data file, and shipped off to a distant server to be analyzed alongside other data from far-flung locations, the dates and times would not match up precisely because of time zone differences 


This means that if anyone is using (or even thinking about using) "Impala timestamps" to say store financial or health services (or security) events, they'll find some nasty "surprises" (even if they don't plan to ever move their servers and only do business in one timezone). 




Consider for instance the case of anything that occurred between 1am and 2am EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT to EST) - there's no way to store the timestamps for these events in Kudu via Impala. 

To prove this I wrote this simple piece of Java code (which uses Java 8 and all well documented and non-deprecated classes and methods) to do just an insert and a select via Impala JDBC of a timestamp row in the simple table that I mentioned above (primary key + timestamp column): 





// run insert 
long primaryKey = 1L; 
PreparedStatement insert = connection.prepareStatement("insert into " + table + " values (?, ?)"); 
insert.setLong(1, primaryKey); 
Timestamp timestampIn = new Timestamp(1509859800000L); 
System.out.println("TimestampIn: " + timestampIn + " - getTime(): " + timestampIn.getTime()); 
insert.setTimestamp(2, timestampIn); 
insert.executeUpdate(); 
insert.close(); 




// run select 
PreparedStatement select = connection.prepareStatement("select " + timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?"); 
select.setLong(1, primaryKey); 
ResultSet resultSet = select.executeQuery(); 
while (resultSet.next()) { 
Timestamp timestampOut = resultSet.getTimestamp(1); 
System.out.println("TimestampOut: " + timestampOut + " - getTime(): " + timestampOut.getTime()); 
} 
resultSet.close(); 
select.close(); 





and this is the output: 




TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000 
TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000 





If this kind of timestamp had been designed say in 1972, I would have probably been OK with it, but in 2017, when people and companies do business 24/7, I find this approach (and sentences like "dates and times would not match up precisely") unacceptable. 





To its defense, the Impala TIMESTAMP web page mentioned above spends several paragraphs about how to deal with "UTC timestamps" and problems like the ones shown above, but it basically has the user wrap everything in Impala functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'. 





This proposed solution - besides the performance impact of say having a query like this: 




select ... where to_utc_timestamp(timestamp_column_a) < to_utc_timestamp(timestamp_column_b) 




makes also much harder to write queries having timestamp fields (which is very common in my workplace) or to use it with automated BI tools. 







It is definitely not what one would expect from an analytic MPP database, that aspires to be used at the Enterprise level in 2017. 





Regards, 
Franco Venturi

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Posted by Greg Rahn <gr...@gmail.com>.

Makes sense.
If the application were to operate in UTC times (which is generally a good
practice), it would work as desired since there is no DST with UTC so the
>= would eval correctly.

Out of curiosity, when you say "TIMESTAMP WITH TIME ZONE", is this in
reference to the type of this name from Oracle, or perhaps some other
RDBMS?  Just curious since there are various names and behaviors out there.

On Sat, Jan 13, 2018 at 9:05 AM, Franco Venturi <fv...@comcast.net>
wrote:

> Greg, thanks for checking.
>
>
> Since we needed a short term solution for a project that is supposed to go
> live shortly and we understand that this would be a major change/addition
> in the Impala data types and consequently in its code base, we found a
> workable solution for now and we are going to "live with it".
>
> While you were out we also discussed it with Peter Ebert and I created a
> new 'Feature Request' with our vendor for the 'TIMESTAMP WITH TIMEZONE'
> data type in Impala.
>
> For your convenience this is what I put in the support case I created:
>
>
>     There are several use cases where the current 'TIMESTAMP' (without
> timezone) data type in Impala falls short.
>
>     One such case would be in a Kudu replication scenario where a
> secondary (replica) instance of the Kudu data would have to be periodically
> (say every 15 minutes) updated with the data from say a primary instance.
> In this case Todd Lipcon suggested us to have a "timestamp" column in Kudu
> called say 'updated_on' and periodically run something like this:
>
>         UPSERT INTO backup_cluster_table SELECT * FROM original_table
> WHERE updated_on >= $last_update_time
>
>     Unfortunately with Impala's current implementation of the 'TIMESTAMP'
> data type, a query like the one above would not work during the Daylight
> Saving Time change in the fall when we "move the clock" back an hour (I can
> send more details about this situation, if needed).
>
>     Another important use case for us (and probable not just us) is for
> auditing purposes (regulatory compliance, etc) where we need to know when
> (in absolute time) an event occurred – a very good example for this use
> case would be with customers trying to use Impala/Kudu for storing logs of
> events, or for security applications, where the information about the
> absolute time an event occurred is crucial.
>
>
> Franco
>
>
> On January 12, 2018 at 12:15 AM Greg Rahn <gr...@gmail.com> wrote:
>
> Franco-
>
> Are there still outstanding questions or issues with this?  I'm just
> returning from holiday so wanted to check in.
>
> Agreed that timestamp w/ and w/o time zone are different beasts and come
> with their own challenges.
>
> Cheers,
>
>
> On Tue, Dec 19, 2017 at 6:23 AM, Franco Venturi <fv...@comcast.net>
> wrote:
>
> Thanks for your input Belazs.
> I am still pondering about this issue and I found very useful to follow
> your advice on seeing what's already in JIRA and doing some research on
> what others are doing.
>
> I didn't find anything that exactly describes my problem on the Impala
> JIRA (I might have missed it), however I did find a closely related
> discussion that is occurring in the Hive JIRA: I am referring to JIRAs
> HIVE-14305 (https://issues.apache.org/jira/browse/HIVE-14305), HIVE-14412
> (https://issues.apache.org/jira/browse/HIVE-14412), and HIVE-16418 (
> https://issues.apache.org/jira/browse/HIVE-16418).
>
> I found also useful to read how two of the most used RDBMS are dealing
> with timestamps:
>    - Oracle (see: https://docs.oracle.com/en/dat
> abase/oracle/oracle-database/12.2/nlspg/datetime-data-
> types-and-time-zone-support.html)
>    - PostgreSQL (see: https://www.postgresql.org/doc
> s/10/static/datatype-datetime.html)
>
> Finally I am going to refer to the Java 8 API as described here:
> https://docs.oracle.com/javase/8/docs/api/overview-summary.html; the
> reason is that a lot of Impala users (me included) connect to it via
> Java/JDBC and it is very important in my opinion that the Impala data types
> are correctly mapped to the correct Java classes to avoid confusion.
>
>
> After this long preamble here are some notes that hopefully will help in
> this discussion.
>
> - At the center of this issue is that there are two significantly
> different data types that we commonly refer to when we use the word
> 'timestamp':
>       - timestamp without time zone (most commonly referred to as just
> 'timestamp')
>       - timestamp with time zone
>
> - These two types are like apples and oranges in my opinion; getting
> confused between the two (like I did) can cause a lot of frustration (my
> first post in this thread shows that pretty well..)
>
> - The first data type ('timestamp without time zone' or just 'timestamp')
> is best described in this sentence in section 4.2.1.6 (Choosing a TIMESTAMP
> Data Type) in the Oracle document:
>
>         "Use the TIMESTAMP data type when you need a datetime value to
> record the time of an event without the time zone. For example, you can
> store information about the times when workers punch a time card in and out
> of their assembly line workstations. Because this is always a local time it
> is then not needed to store the timezone part"
>
> - I think this is the kind of timestamp that is currently implemented in
> Impala (please correct me if I am wrong) and in my opinion it should be
> mapped to something like the Java type java.time.LocalDateTime; the Java 8
> API for java.time.LocalDateTime has this sentence that I think provides a
> good insight on the meaning of this data type:
>
>         "This class does not store or represent a time-zone. Instead, it
> is a description of the date, as used for birthdays, combined with the
> local time as seen on a wall clock. It cannot represent an instant on the
> time-line without additional information such as an offset or time-zone."
>
> - Also this timestamp (again in my opinion) should not be mapped to the
> Kudu type UNIXTIME_MICROS, because their meaning is different (and this is
> what triggered my initial confusion, and I suspect I am not the only one
> out there, who misunderstood this difference)
>
>
> - For the second data type ('timestamp with time zone') I couldn't find a
> good definition in the Oracle document, however this sentence from the Java
> API for java.util.Date (of which java.sql.Timestamp is a subclass) offers a
> good insight:
>
>         "The class Date represents a specific instant in time"
>
> - This is also indirectly implied by the following sentence in the Oracle
> document:
>
>         "Two TIMESTAMP WITH TIME ZONE values are considered identical if
> they represent the same instant in UTC, regardless of the TIME ZONE offsets
> stored in the data."
>
> - The 'timestamp with time zone' is what I think should be mapped to the
> Java type 'java.sql.Timezone' and what also corresponds to the actual
> meaning of the Kudu type UNIXTIME_MICROS
>
> - This is the kind of timestamp type that is most useful to businesses
> (financial, healthcase, security logs) because it captures the "absolute"
> moment in time a transaction or an event occurred, regardless of which
> timezone it occurred, or if it was daylight saving time or not
>
>
> In conclusion the point I am at right now is that to really fix the issue
> in Impala, two things should occur:
>
> - a new data type ('timestamp with time zone'), with the features
> mentioned above, should be implemented - this change would hopefully not
> break any of the existing applications
>
> - the current 'timestamp' type should be mapped to the Java class
> java.time.LocalDateTime (not java.sql.Timestamp) - this would definitely
> break existing applications, and therefore should be postponed to Impala 3.0
>
>
> Regards,
> Franco
>
>
> ------------------------------
> *From: *"Jeszy" <je...@gmail.com>
> *To: *user@impala.apache.org
> *Sent: *Monday, December 18, 2017 7:49:52 AM
> *Subject: *Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)
>
> Hello Franco,
>
> Thanks for your feedback! I agree there are pain points with using
> timestamps, especially together with other systems.
> Is there any particular approach or solution you propose that would
> work well for you? Have you found any jiras on issues.apache.org that
> describe what you're asking for? Commenting on a jira will help the
> team track your input better.
>
> Regards,
> Balazs
>
> On 17 December 2017 at 00:38, Franco Venturi <fv...@comcast.net> wrote:
> > Please note that the discussion below refers to the following versions:
> >       - Impala: v2.10.0-cdh5.13.0
> >       - Kudu: 1.5.0-cdh5.13.0
> >       - Everything runs on a standard Cloudera 5.13 installation
> >
> >
> > A few days ago I was writing some Java code to migrate several tables
> > directly from Oracle to Kudu (to be queried later on by our developers
> and
> > BI tools using Impala). Most of these tables have columns that are of
> type
> > "timestamp" (to be exact, they come in as instances of class
> > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest
> of
> > this discussion I'll assume we only deal with objects of
> java.sql.Timestamp,
> > to make things simple).
> > As you probably know, Kudu, starting I think with version 1.3.1, has a
> type
> > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> > TIMESTAMP" data type
> > (https://www.cloudera.com/documentation/enterprise/latest/
> topics/impala_timestamp.html).
> >
> >
> > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> > 'Apache Kudu Schema Design' document
> > (https://kudu.apache.org/docs/schema_design.html), which says:
> >
> >
> >       unixtime_micros (64-bit microseconds since the Unix epoch)
> >
> >
> > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
> >
> >
> > With this understanding I went ahead and wrote my Java code; when I ran
> the
> > first few tests, I noticed that the timestamp values returned by Impala
> (I
> > created in Impala an 'external' table 'stored as kudu') were off by
> several
> > hours compared to the values returned by the original table in Oracle
> (our
> > servers, both the Oracle ones and the Impala/Kudu ones, are all
> configured
> > in the 'America/New_York' timezone).
> >
> >
> > To investigate this difference, I created a simple table in Kudu with
> just
> > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> > timestamp. I ran a few inserts and selects over this table in Impala and
> > figured out that Impala stores a value that is more or less defined as
> > follow:
> >
> >
> >       number of microseconds since the Unix epoch (i.e. what I was
> expecting
> > originally)
> >       + offset of the timestamp I inserted with respect to GMT (in my
> case
> > this offset is the offset for EST or EDT depending if that timestamp was
> > during EST (winter) or EDT (summer))
> >
> >
> > This is how Impala achieves what is described as:
> >
> >
> >       Impala does not store timestamps using the local timezone, to avoid
> > undesired results from unexpected time zone issues
> >
> >
> > That same page has caveats like the following, that sent a shiver down my
> > spine:
> >
> >
> >       If that value was written to a data file, and shipped off to a
> distant
> > server to be analyzed alongside other data from far-flung locations, the
> > dates and times would not match up precisely because of time zone
> > differences
> >
> >
> > This means that if anyone is using (or even thinking about using) "Impala
> > timestamps" to say store financial or health services (or security)
> events,
> > they'll find some nasty "surprises" (even if they don't plan to ever move
> > their servers and only do business in one timezone).
> >
> >
> > Consider for instance the case of anything that occurred between 1am and
> 2am
> > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from
> EDT
> > to EST) - there's no way to store the timestamps for these events in Kudu
> > via Impala.
> >
> > To prove this I wrote this simple piece of Java code (which uses Java 8
> and
> > all well documented and non-deprecated classes and methods) to do just an
> > insert and a select via Impala JDBC of a timestamp row in the simple
> table
> > that I mentioned above (primary key + timestamp column):
> >
> >
> >
> >       // run insert
> >       long primaryKey = 1L;
> >       PreparedStatement insert = connection.prepareStatement("insert
> into "
> > + table + " values (?, ?)");
> >       insert.setLong(1, primaryKey);
> >       Timestamp timestampIn = new Timestamp(1509859800000L);
> >       System.out.println("TimestampIn: " + timestampIn + " - getTime():
> " +
> > timestampIn.getTime());
> >       insert.setTimestamp(2, timestampIn);
> >       insert.executeUpdate();
> >       insert.close();
> >
> >
> >       // run select
> >       PreparedStatement select = connection.prepareStatement("select " +
> > timestampColumn + " from " + table + " where " + primaryKeyColumn +
> "=?");
> >       select.setLong(1, primaryKey);
> >       ResultSet resultSet = select.executeQuery();
> >       while (resultSet.next()) {
> >           Timestamp timestampOut = resultSet.getTimestamp(1);
> >           System.out.println("TimestampOut: " + timestampOut + " -
> > getTime(): " + timestampOut.getTime());
> >       }
> >       resultSet.close();
> >       select.close();
> >
> >
> >
> > and this is the output:
> >
> >
> >      TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
> >      TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000
> >
> >
> >
> > If this kind of timestamp had been designed say in 1972, I would have
> > probably been OK with it, but in 2017, when people and companies do
> business
> > 24/7, I find this approach (and sentences like "dates and times would not
> > match up precisely") unacceptable.
> >
> >
> >
> > To its defense, the Impala TIMESTAMP web page mentioned above spends
> several
> > paragraphs about how to deal with "UTC timestamps" and problems like the
> > ones shown above, but it basically has the user wrap everything in Impala
> > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.
> >
> >
> > This proposed solution - besides the performance impact of say having a
> > query like this:
> >
> >
> >       select ... where to_utc_timestamp(timestamp_column_a) <
> > to_utc_timestamp(timestamp_column_b)
> >
> >
> > makes also much harder to write queries having timestamp fields (which is
> > very common in my workplace) or to use it with automated BI tools.
> >
> >
> >
> > It is definitely not what one would expect from an analytic MPP database,
> > that aspires to be used at the Enterprise level in 2017.
> >
> >
> >
> > Regards,
> > Franco Venturi
>
>
>

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Posted by Franco Venturi <fv...@comcast.net>.

Greg, thanks for checking.


Since we needed a short term solution for a project that is supposed to go live shortly and we understand that this would be a major change/addition in the Impala data types and consequently in its code base, we found a workable solution for now and we are going to "live with it".

While you were out we also discussed it with Peter Ebert and I created a new 'Feature Request' with our vendor for the 'TIMESTAMP WITH TIMEZONE' data type in Impala.

For your convenience this is what I put in the support case I created:


    There are several use cases where the current 'TIMESTAMP' (without timezone) data type in Impala falls short.

    One such case would be in a Kudu replication scenario where a secondary (replica) instance of the Kudu data would have to be periodically (say every 15 minutes) updated with the data from say a primary instance. In this case Todd Lipcon suggested us to have a "timestamp" column in Kudu called say 'updated_on' and periodically run something like this:

        UPSERT INTO backup_cluster_table SELECT * FROM original_table WHERE updated_on >= $last_update_time

    Unfortunately with Impala's current implementation of the 'TIMESTAMP' data type, a query like the one above would not work during the Daylight Saving Time change in the fall when we "move the clock" back an hour (I can send more details about this situation, if needed).

    Another important use case for us (and probable not just us) is for auditing purposes (regulatory compliance, etc) where we need to know when (in absolute time) an event occurred – a very good example for this use case would be with customers trying to use Impala/Kudu for storing logs of events, or for security applications, where the information about the absolute time an event occurred is crucial.


Franco


> On January 12, 2018 at 12:15 AM Greg Rahn <gr...@gmail.com> wrote:
> 
>     Franco-
> 
>     Are there still outstanding questions or issues with this?  I'm just returning from holiday so wanted to check in.  
> 
>     Agreed that timestamp w/ and w/o time zone are different beasts and come with their own challenges. 
> 
>     Cheers,
> 
> 
>     On Tue, Dec 19, 2017 at 6:23 AM, Franco Venturi <fventuri@comcast.net mailto:fventuri@comcast.net > wrote:
> 
>         > >         Thanks for your input Belazs.
> >         I am still pondering about this issue and I found very useful to follow your advice on seeing what's already in JIRA and doing some research on what others are doing.
> > 
> >         I didn't find anything that exactly describes my problem on the Impala JIRA (I might have missed it), however I did find a closely related discussion that is occurring in the Hive JIRA: I am referring to JIRAs HIVE-14305 (https://issues.apache.org/jira/browse/HIVE-14305 https://issues.apache.org/jira/browse/HIVE-14305 ), HIVE-14412 (https://issues.apache.org/jira/browse/HIVE-14412 https://issues.apache.org/jira/browse/HIVE-14412 ), and HIVE-16418 (https://issues.apache.org/jira/browse/HIVE-16418 https://issues.apache.org/jira/browse/HIVE-16418 ).
> > 
> >         I found also useful to read how two of the most used RDBMS are dealing with timestamps:
> >            - Oracle (see: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/nlspg/datetime-data-types-and-time-zone-support.html https://docs.oracle.com/en/database/oracle/oracle-database/12.2/nlspg/datetime-data-types-and-time-zone-support.html )
> >            - PostgreSQL (see: https://www.postgresql.org/docs/10/static/datatype-datetime.html https://www.postgresql.org/docs/10/static/datatype-datetime.html )
> > 
> >         Finally I am going to refer to the Java 8 API as described here: https://docs.oracle.com/javase/8/docs/api/overview-summary.html https://docs.oracle.com/javase/8/docs/api/overview-summary.html ; the reason is that a lot of Impala users (me included) connect to it via Java/JDBC and it is very important in my opinion that the Impala data types are correctly mapped to the correct Java classes to avoid confusion.
> > 
> > 
> >         After this long preamble here are some notes that hopefully will help in this discussion.
> > 
> >         - At the center of this issue is that there are two significantly different data types that we commonly refer to when we use the word 'timestamp':
> >               - timestamp without time zone (most commonly referred to as just 'timestamp')
> >               - timestamp with time zone
> > 
> >         - These two types are like apples and oranges in my opinion; getting confused between the two (like I did) can cause a lot of frustration (my first post in this thread shows that pretty well..)
> > 
> >         - The first data type ('timestamp without time zone' or just 'timestamp') is best described in this sentence in section 4.2.1.6 (Choosing a TIMESTAMP Data Type) in the Oracle document:
> > 
> >                 "Use the TIMESTAMP data type when you need a datetime value to record the time of an event without the time zone. For example, you can store information about the times when workers punch a time card in and out of their assembly line workstations. Because this is always a local time it is then not needed to store the timezone part"
> > 
> >         - I think this is the kind of timestamp that is currently implemented in Impala (please correct me if I am wrong) and in my opinion it should be mapped to something like the Java type java.time.LocalDateTime; the Java 8 API for java.time.LocalDateTime has this sentence that I think provides a good insight on the meaning of this data type:
> > 
> >                 "This class does not store or represent a time-zone. Instead, it is a description of the date, as used for birthdays, combined with the local time as seen on a wall clock. It cannot represent an instant on the time-line without additional information such as an offset or time-zone."
> > 
> >         - Also this timestamp (again in my opinion) should not be mapped to the Kudu type UNIXTIME_MICROS, because their meaning is different (and this is what triggered my initial confusion, and I suspect I am not the only one out there, who misunderstood this difference)
> > 
> > 
> >         - For the second data type ('timestamp with time zone') I couldn't find a good definition in the Oracle document, however this sentence from the Java API for java.util.Date (of which java.sql.Timestamp is a subclass) offers a good insight:
> > 
> >                 "The class Date represents a specific instant in time"
> > 
> >         - This is also indirectly implied by the following sentence in the Oracle document:
> > 
> >                 "Two TIMESTAMP WITH TIME ZONE values are considered identical if they represent the same instant in UTC, regardless of the TIME ZONE offsets stored in the data."
> > 
> >         - The 'timestamp with time zone' is what I think should be mapped to the Java type 'java.sql.Timezone' and what also corresponds to the actual meaning of the Kudu type UNIXTIME_MICROS
> > 
> >         - This is the kind of timestamp type that is most useful to businesses (financial, healthcase, security logs) because it captures the "absolute" moment in time a transaction or an event occurred, regardless of which timezone it occurred, or if it was daylight saving time or not
> > 
> > 
> >         In conclusion the point I am at right now is that to really fix the issue in Impala, two things should occur:
> > 
> >         - a new data type ('timestamp with time zone'), with the features mentioned above, should be implemented - this change would hopefully not break any of the existing applications
> > 
> >         - the current 'timestamp' type should be mapped to the Java class java.time.LocalDateTime (not java.sql.Timestamp) - this would definitely break existing applications, and therefore should be postponed to Impala 3.0
> > 
> > 
> >         Regards,
> >         Franco
> > 
> > 
> > 
> >         ---------------------------------------------
> >         From: "Jeszy" <jeszyb@gmail.com mailto:jeszyb@gmail.com >
> >         To: user@impala.apache.org mailto:user@impala.apache.org
> >         Sent: Monday, December 18, 2017 7:49:52 AM
> >         Subject: Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)
> > 
> >         Hello Franco,
> > 
> >         Thanks for your feedback! I agree there are pain points with using
> >         timestamps, especially together with other systems.
> >         Is there any particular approach or solution you propose that would
> >         work well for you? Have you found any jiras onhttp://issues.apache.org that
> >         describe what you're asking for? Commenting on a jira will help the
> >         team track your input better.
> > 
> >         Regards,
> >         Balazs
> > 
> >         On 17 December 2017 at 00:38, Franco Venturi <fventuri@comcast.net mailto:fventuri@comcast.net > wrote:
> >         > Please note that the discussion below refers to the following versions:
> >         >       - Impala: v2.10.0-cdh5.13.0
> >         >       - Kudu: 1.5.0-cdh5.13.0
> >         >       - Everything runs on a standard Cloudera 5.13 installation
> >         >
> >         >
> >         > A few days ago I was writing some Java code to migrate several tables
> >         > directly from Oracle to Kudu (to be queried later on by our developers and
> >         > BI tools using Impala). Most of these tables have columns that are of type
> >         > "timestamp" (to be exact, they come in as instances of class
> >         > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of
> >         > this discussion I'll assume we only deal with objects of java.sql.Timestamp,
> >         > to make things simple).
> >         > As you probably know, Kudu, starting I think with version 1.3.1, has a type
> >         > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> >         > TIMESTAMP" data type
> >         > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html ).
> >         >
> >         >
> >         > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> >         > 'Apache Kudu Schema Design' document
> >         > (https://kudu.apache.org/docs/schema_design.html https://kudu.apache.org/docs/schema_design.html ), which says:
> >         >
> >         >
> >         >       unixtime_micros (64-bit microseconds since the Unix epoch)
> >         >
> >         >
> >         > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
> >         >
> >         >
> >         > With this understanding I went ahead and wrote my Java code; when I ran the
> >         > first few tests, I noticed that the timestamp values returned by Impala (I
> >         > created in Impala an 'external' table 'stored as kudu') were off by several
> >         > hours compared to the values returned by the original table in Oracle (our
> >         > servers, both the Oracle ones and the Impala/Kudu ones, are all configured
> >         > in the 'America/New_York' timezone).
> >         >
> >         >
> >         > To investigate this difference, I created a simple table in Kudu with just
> >         > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> >         > timestamp. I ran a few inserts and selects over this table in Impala and
> >         > figured out that Impala stores a value that is more or less defined as
> >         > follow:
> >         >
> >         >
> >         >       number of microseconds since the Unix epoch (i.e. what I was expecting
> >         > originally)
> >         >       + offset of the timestamp I inserted with respect to GMT (in my case
> >         > this offset is the offset for EST or EDT depending if that timestamp was
> >         > during EST (winter) or EDT (summer))
> >         >
> >         >
> >         > This is how Impala achieves what is described as:
> >         >
> >         >
> >         >       Impala does not store timestamps using the local timezone, to avoid
> >         > undesired results from unexpected time zone issues
> >         >
> >         >
> >         > That same page has caveats like the following, that sent a shiver down my
> >         > spine:
> >         >
> >         >
> >         >       If that value was written to a data file, and shipped off to a distant
> >         > server to be analyzed alongside other data from far-flung locations, the
> >         > dates and times would not match up precisely because of time zone
> >         > differences
> >         >
> >         >
> >         > This means that if anyone is using (or even thinking about using) "Impala
> >         > timestamps" to say store financial or health services (or security) events,
> >         > they'll find some nasty "surprises" (even if they don't plan to ever move
> >         > their servers and only do business in one timezone).
> >         >
> >         >
> >         > Consider for instance the case of anything that occurred between 1am and 2am
> >         > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT
> >         > to EST) - there's no way to store the timestamps for these events in Kudu
> >         > via Impala.
> >         >
> >         > To prove this I wrote this simple piece of Java code (which uses Java 8 and
> >         > all well documented and non-deprecated classes and methods) to do just an
> >         > insert and a select via Impala JDBC of a timestamp row in the simple table
> >         > that I mentioned above (primary key + timestamp column):
> >         >
> >         >
> >         >
> >         >       // run insert
> >         >       long primaryKey = 1L;
> >         >       PreparedStatement insert = connection.prepareStatement("insert into "
> >         > + table + " values (?, ?)");
> >         >       insert.setLong(1, primaryKey);
> >         >       Timestamp timestampIn = new Timestamp(1509859800000L);
> >         >       System.out.println("TimestampIn: " + timestampIn + " - getTime(): " +
> >         > timestampIn.getTime());
> >         >       insert.setTimestamp(2, timestampIn);
> >         >       insert.executeUpdate();
> >         >       insert.close();
> >         >
> >         >
> >         >       // run select
> >         >       PreparedStatement select = connection.prepareStatement("select " +
> >         > timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?");
> >         >       select.setLong(1, primaryKey);
> >         >       ResultSet resultSet = select.executeQuery();
> >         >       while (resultSet.next()) {
> >         >           Timestamp timestampOut = resultSet.getTimestamp(1);
> >         >           System.out.println("TimestampOut: " + timestampOut + " -
> >         > getTime(): " + timestampOut.getTime());
> >         >       }
> >         >       resultSet.close();
> >         >       select.close();
> >         >
> >         >
> >         >
> >         > and this is the output:
> >         >
> >         >
> >         >      TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
> >         >      TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000
> >         >
> >         >
> >         >
> >         > If this kind of timestamp had been designed say in 1972, I would have
> >         > probably been OK with it, but in 2017, when people and companies do business
> >         > 24/7, I find this approach (and sentences like "dates and times would not
> >         > match up precisely") unacceptable.
> >         >
> >         >
> >         >
> >         > To its defense, the Impala TIMESTAMP web page mentioned above spends several
> >         > paragraphs about how to deal with "UTC timestamps" and problems like the
> >         > ones shown above, but it basically has the user wrap everything in Impala
> >         > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.
> >         >
> >         >
> >         > This proposed solution - besides the performance impact of say having a
> >         > query like this:
> >         >
> >         >
> >         >       select ... where to_utc_timestamp(timestamp_column_a) <
> >         > to_utc_timestamp(timestamp_column_b)
> >         >
> >         >
> >         > makes also much harder to write queries having timestamp fields (which is
> >         > very common in my workplace) or to use it with automated BI tools.
> >         >
> >         >
> >         >
> >         > It is definitely not what one would expect from an analytic MPP database,
> >         > that aspires to be used at the Enterprise level in 2017.
> >         >
> >         >
> >         >
> >         > Regards,
> >         > Franco Venturi
> > 
> > 
> >     > 
>

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Posted by Greg Rahn <gr...@gmail.com>.

Franco-

Are there still outstanding questions or issues with this?  I'm just
returning from holiday so wanted to check in.

Agreed that timestamp w/ and w/o time zone are different beasts and come
with their own challenges.

Cheers,


On Tue, Dec 19, 2017 at 6:23 AM, Franco Venturi <fv...@comcast.net>
wrote:

> Thanks for your input Belazs.
> I am still pondering about this issue and I found very useful to follow
> your advice on seeing what's already in JIRA and doing some research on
> what others are doing.
>
> I didn't find anything that exactly describes my problem on the Impala
> JIRA (I might have missed it), however I did find a closely related
> discussion that is occurring in the Hive JIRA: I am referring to JIRAs
> HIVE-14305 (https://issues.apache.org/jira/browse/HIVE-14305), HIVE-14412
> (https://issues.apache.org/jira/browse/HIVE-14412), and HIVE-16418 (
> https://issues.apache.org/jira/browse/HIVE-16418).
>
> I found also useful to read how two of the most used RDBMS are dealing
> with timestamps:
>    - Oracle (see: https://docs.oracle.com/en/database/oracle/oracle-
> database/12.2/nlspg/datetime-data-types-and-time-zone-support.html)
>    - PostgreSQL (see: https://www.postgresql.org/docs/10/static/datatype-
> datetime.html)
>
> Finally I am going to refer to the Java 8 API as described here:
> https://docs.oracle.com/javase/8/docs/api/overview-summary.html; the
> reason is that a lot of Impala users (me included) connect to it via
> Java/JDBC and it is very important in my opinion that the Impala data types
> are correctly mapped to the correct Java classes to avoid confusion.
>
>
> After this long preamble here are some notes that hopefully will help in
> this discussion.
>
> - At the center of this issue is that there are two significantly
> different data types that we commonly refer to when we use the word
> 'timestamp':
>       - timestamp without time zone (most commonly referred to as just
> 'timestamp')
>       - timestamp with time zone
>
> - These two types are like apples and oranges in my opinion; getting
> confused between the two (like I did) can cause a lot of frustration (my
> first post in this thread shows that pretty well..)
>
> - The first data type ('timestamp without time zone' or just 'timestamp')
> is best described in this sentence in section 4.2.1.6 (Choosing a TIMESTAMP
> Data Type) in the Oracle document:
>
>         "Use the TIMESTAMP data type when you need a datetime value to
> record the time of an event without the time zone. For example, you can
> store information about the times when workers punch a time card in and out
> of their assembly line workstations. Because this is always a local time it
> is then not needed to store the timezone part"
>
> - I think this is the kind of timestamp that is currently implemented in
> Impala (please correct me if I am wrong) and in my opinion it should be
> mapped to something like the Java type java.time.LocalDateTime; the Java 8
> API for java.time.LocalDateTime has this sentence that I think provides a
> good insight on the meaning of this data type:
>
>         "This class does not store or represent a time-zone. Instead, it
> is a description of the date, as used for birthdays, combined with the
> local time as seen on a wall clock. It cannot represent an instant on the
> time-line without additional information such as an offset or time-zone."
>
> - Also this timestamp (again in my opinion) should not be mapped to the
> Kudu type UNIXTIME_MICROS, because their meaning is different (and this is
> what triggered my initial confusion, and I suspect I am not the only one
> out there, who misunderstood this difference)
>
>
> - For the second data type ('timestamp with time zone') I couldn't find a
> good definition in the Oracle document, however this sentence from the Java
> API for java.util.Date (of which java.sql.Timestamp is a subclass) offers a
> good insight:
>
>         "The class Date represents a specific instant in time"
>
> - This is also indirectly implied by the following sentence in the Oracle
> document:
>
>         "Two TIMESTAMP WITH TIME ZONE values are considered identical if
> they represent the same instant in UTC, regardless of the TIME ZONE offsets
> stored in the data."
>
> - The 'timestamp with time zone' is what I think should be mapped to the
> Java type 'java.sql.Timezone' and what also corresponds to the actual
> meaning of the Kudu type UNIXTIME_MICROS
>
> - This is the kind of timestamp type that is most useful to businesses
> (financial, healthcase, security logs) because it captures the "absolute"
> moment in time a transaction or an event occurred, regardless of which
> timezone it occurred, or if it was daylight saving time or not
>
>
> In conclusion the point I am at right now is that to really fix the issue
> in Impala, two things should occur:
>
> - a new data type ('timestamp with time zone'), with the features
> mentioned above, should be implemented - this change would hopefully not
> break any of the existing applications
>
> - the current 'timestamp' type should be mapped to the Java class
> java.time.LocalDateTime (not java.sql.Timestamp) - this would definitely
> break existing applications, and therefore should be postponed to Impala 3.0
>
>
> Regards,
> Franco
>
>
> ------------------------------
> *From: *"Jeszy" <je...@gmail.com>
> *To: *user@impala.apache.org
> *Sent: *Monday, December 18, 2017 7:49:52 AM
> *Subject: *Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)
>
> Hello Franco,
>
> Thanks for your feedback! I agree there are pain points with using
> timestamps, especially together with other systems.
> Is there any particular approach or solution you propose that would
> work well for you? Have you found any jiras on issues.apache.org that
> describe what you're asking for? Commenting on a jira will help the
> team track your input better.
>
> Regards,
> Balazs
>
> On 17 December 2017 at 00:38, Franco Venturi <fv...@comcast.net> wrote:
> > Please note that the discussion below refers to the following versions:
> >       - Impala: v2.10.0-cdh5.13.0
> >       - Kudu: 1.5.0-cdh5.13.0
> >       - Everything runs on a standard Cloudera 5.13 installation
> >
> >
> > A few days ago I was writing some Java code to migrate several tables
> > directly from Oracle to Kudu (to be queried later on by our developers
> and
> > BI tools using Impala). Most of these tables have columns that are of
> type
> > "timestamp" (to be exact, they come in as instances of class
> > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest
> of
> > this discussion I'll assume we only deal with objects of
> java.sql.Timestamp,
> > to make things simple).
> > As you probably know, Kudu, starting I think with version 1.3.1, has a
> type
> > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> > TIMESTAMP" data type
> > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_
> timestamp.html).
> >
> >
> > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> > 'Apache Kudu Schema Design' document
> > (https://kudu.apache.org/docs/schema_design.html), which says:
> >
> >
> >       unixtime_micros (64-bit microseconds since the Unix epoch)
> >
> >
> > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
> >
> >
> > With this understanding I went ahead and wrote my Java code; when I ran
> the
> > first few tests, I noticed that the timestamp values returned by Impala
> (I
> > created in Impala an 'external' table 'stored as kudu') were off by
> several
> > hours compared to the values returned by the original table in Oracle
> (our
> > servers, both the Oracle ones and the Impala/Kudu ones, are all
> configured
> > in the 'America/New_York' timezone).
> >
> >
> > To investigate this difference, I created a simple table in Kudu with
> just
> > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> > timestamp. I ran a few inserts and selects over this table in Impala and
> > figured out that Impala stores a value that is more or less defined as
> > follow:
> >
> >
> >       number of microseconds since the Unix epoch (i.e. what I was
> expecting
> > originally)
> >       + offset of the timestamp I inserted with respect to GMT (in my
> case
> > this offset is the offset for EST or EDT depending if that timestamp was
> > during EST (winter) or EDT (summer))
> >
> >
> > This is how Impala achieves what is described as:
> >
> >
> >       Impala does not store timestamps using the local timezone, to avoid
> > undesired results from unexpected time zone issues
> >
> >
> > That same page has caveats like the following, that sent a shiver down my
> > spine:
> >
> >
> >       If that value was written to a data file, and shipped off to a
> distant
> > server to be analyzed alongside other data from far-flung locations, the
> > dates and times would not match up precisely because of time zone
> > differences
> >
> >
> > This means that if anyone is using (or even thinking about using) "Impala
> > timestamps" to say store financial or health services (or security)
> events,
> > they'll find some nasty "surprises" (even if they don't plan to ever move
> > their servers and only do business in one timezone).
> >
> >
> > Consider for instance the case of anything that occurred between 1am and
> 2am
> > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from
> EDT
> > to EST) - there's no way to store the timestamps for these events in Kudu
> > via Impala.
> >
> > To prove this I wrote this simple piece of Java code (which uses Java 8
> and
> > all well documented and non-deprecated classes and methods) to do just an
> > insert and a select via Impala JDBC of a timestamp row in the simple
> table
> > that I mentioned above (primary key + timestamp column):
> >
> >
> >
> >       // run insert
> >       long primaryKey = 1L;
> >       PreparedStatement insert = connection.prepareStatement("insert
> into "
> > + table + " values (?, ?)");
> >       insert.setLong(1, primaryKey);
> >       Timestamp timestampIn = new Timestamp(1509859800000L);
> >       System.out.println("TimestampIn: " + timestampIn + " - getTime():
> " +
> > timestampIn.getTime());
> >       insert.setTimestamp(2, timestampIn);
> >       insert.executeUpdate();
> >       insert.close();
> >
> >
> >       // run select
> >       PreparedStatement select = connection.prepareStatement("select " +
> > timestampColumn + " from " + table + " where " + primaryKeyColumn +
> "=?");
> >       select.setLong(1, primaryKey);
> >       ResultSet resultSet = select.executeQuery();
> >       while (resultSet.next()) {
> >           Timestamp timestampOut = resultSet.getTimestamp(1);
> >           System.out.println("TimestampOut: " + timestampOut + " -
> > getTime(): " + timestampOut.getTime());
> >       }
> >       resultSet.close();
> >       select.close();
> >
> >
> >
> > and this is the output:
> >
> >
> >      TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
> >      TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000
> >
> >
> >
> > If this kind of timestamp had been designed say in 1972, I would have
> > probably been OK with it, but in 2017, when people and companies do
> business
> > 24/7, I find this approach (and sentences like "dates and times would not
> > match up precisely") unacceptable.
> >
> >
> >
> > To its defense, the Impala TIMESTAMP web page mentioned above spends
> several
> > paragraphs about how to deal with "UTC timestamps" and problems like the
> > ones shown above, but it basically has the user wrap everything in Impala
> > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.
> >
> >
> > This proposed solution - besides the performance impact of say having a
> > query like this:
> >
> >
> >       select ... where to_utc_timestamp(timestamp_column_a) <
> > to_utc_timestamp(timestamp_column_b)
> >
> >
> > makes also much harder to write queries having timestamp fields (which is
> > very common in my workplace) or to use it with automated BI tools.
> >
> >
> >
> > It is definitely not what one would expect from an analytic MPP database,
> > that aspires to be used at the Enterprise level in 2017.
> >
> >
> >
> > Regards,
> > Franco Venturi
>
>

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Posted by Franco Venturi <fv...@comcast.net>.

Thanks for your input Belazs. 
I am still pondering about this issue and I found very useful to follow your advice on seeing what's already in JIRA and doing some research on what others are doing. 

I didn't find anything that exactly describes my problem on the Impala JIRA (I might have missed it), however I did find a closely related discussion that is occurring in the Hive JIRA: I am referring to JIRAs HIVE-14305 (https://issues.apache.org/jira/browse/HIVE-14305), HIVE-14412 (https://issues.apache.org/jira/browse/HIVE-14412), and HIVE-16418 (https://issues.apache.org/jira/browse/HIVE-16418). 

I found also useful to read how two of the most used RDBMS are dealing with timestamps: 
- Oracle (see: https://docs.oracle.com/en/database/oracle/oracle-database/12.2/nlspg/datetime-data-types-and-time-zone-support.html) 
- PostgreSQL (see: https://www.postgresql.org/docs/10/static/datatype-datetime.html) 

Finally I am going to refer to the Java 8 API as described here: https://docs.oracle.com/javase/8/docs/api/overview-summary.html; the reason is that a lot of Impala users (me included) connect to it via Java/JDBC and it is very important in my opinion that the Impala data types are correctly mapped to the correct Java classes to avoid confusion. 


After this long preamble here are some notes that hopefully will help in this discussion. 

- At the center of this issue is that there are two significantly different data types that we commonly refer to when we use the word 'timestamp': 
- timestamp without time zone (most commonly referred to as just 'timestamp') 
- timestamp with time zone 

- These two types are like apples and oranges in my opinion; getting confused between the two (like I did) can cause a lot of frustration (my first post in this thread shows that pretty well..) 

- The first data type ('timestamp without time zone' or just 'timestamp') is best described in this sentence in section 4.2.1.6 (Choosing a TIMESTAMP Data Type) in the Oracle document: 

"Use the TIMESTAMP data type when you need a datetime value to record the time of an event without the time zone. For example, you can store information about the times when workers punch a time card in and out of their assembly line workstations. Because this is always a local time it is then not needed to store the timezone part" 

- I think this is the kind of timestamp that is currently implemented in Impala (please correct me if I am wrong) and in my opinion it should be mapped to something like the Java type java.time.LocalDateTime; the Java 8 API for java.time.LocalDateTime has this sentence that I think provides a good insight on the meaning of this data type: 

"This class does not store or represent a time-zone. Instead, it is a description of the date, as used for birthdays, combined with the local time as seen on a wall clock. It cannot represent an instant on the time-line without additional information such as an offset or time-zone." 

- Also this timestamp (again in my opinion) should not be mapped to the Kudu type UNIXTIME_MICROS, because their meaning is different (and this is what triggered my initial confusion, and I suspect I am not the only one out there, who misunderstood this difference) 


- For the second data type ('timestamp with time zone') I couldn't find a good definition in the Oracle document, however this sentence from the Java API for java.util.Date (of which java.sql.Timestamp is a subclass) offers a good insight: 

"The class Date represents a specific instant in time" 

- This is also indirectly implied by the following sentence in the Oracle document: 

"Two TIMESTAMP WITH TIME ZONE values are considered identical if they represent the same instant in UTC, regardless of the TIME ZONE offsets stored in the data." 

- The 'timestamp with time zone' is what I think should be mapped to the Java type 'java.sql.Timezone' and what also corresponds to the actual meaning of the Kudu type UNIXTIME_MICROS 

- This is the kind of timestamp type that is most useful to businesses (financial, healthcase, security logs) because it captures the "absolute" moment in time a transaction or an event occurred, regardless of which timezone it occurred, or if it was daylight saving time or not 


In conclusion the point I am at right now is that to really fix the issue in Impala, two things should occur: 

- a new data type ('timestamp with time zone'), with the features mentioned above, should be implemented - this change would hopefully not break any of the existing applications 

- the current 'timestamp' type should be mapped to the Java class java.time.LocalDateTime (not java.sql.Timestamp) - this would definitely break existing applications, and therefore should be postponed to Impala 3.0 


Regards, 
Franco 


----- Original Message -----

From: "Jeszy" <je...@gmail.com> 
To: user@impala.apache.org 
Sent: Monday, December 18, 2017 7:49:52 AM 
Subject: Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...) 

Hello Franco, 

Thanks for your feedback! I agree there are pain points with using 
timestamps, especially together with other systems. 
Is there any particular approach or solution you propose that would 
work well for you? Have you found any jiras on issues.apache.org that 
describe what you're asking for? Commenting on a jira will help the 
team track your input better. 

Regards, 
Balazs 

On 17 December 2017 at 00:38, Franco Venturi <fv...@comcast.net> wrote: 
> Please note that the discussion below refers to the following versions: 
> - Impala: v2.10.0-cdh5.13.0 
> - Kudu: 1.5.0-cdh5.13.0 
> - Everything runs on a standard Cloudera 5.13 installation 
> 
> 
> A few days ago I was writing some Java code to migrate several tables 
> directly from Oracle to Kudu (to be queried later on by our developers and 
> BI tools using Impala). Most of these tables have columns that are of type 
> "timestamp" (to be exact, they come in as instances of class 
> oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of 
> this discussion I'll assume we only deal with objects of java.sql.Timestamp, 
> to make things simple). 
> As you probably know, Kudu, starting I think with version 1.3.1, has a type 
> called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala 
> TIMESTAMP" data type 
> (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html). 
> 
> 
> A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the 
> 'Apache Kudu Schema Design' document 
> (https://kudu.apache.org/docs/schema_design.html), which says: 
> 
> 
> unixtime_micros (64-bit microseconds since the Unix epoch) 
> 
> 
> where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT. 
> 
> 
> With this understanding I went ahead and wrote my Java code; when I ran the 
> first few tests, I noticed that the timestamp values returned by Impala (I 
> created in Impala an 'external' table 'stored as kudu') were off by several 
> hours compared to the values returned by the original table in Oracle (our 
> servers, both the Oracle ones and the Impala/Kudu ones, are all configured 
> in the 'America/New_York' timezone). 
> 
> 
> To investigate this difference, I created a simple table in Kudu with just 
> two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a 
> timestamp. I ran a few inserts and selects over this table in Impala and 
> figured out that Impala stores a value that is more or less defined as 
> follow: 
> 
> 
> number of microseconds since the Unix epoch (i.e. what I was expecting 
> originally) 
> + offset of the timestamp I inserted with respect to GMT (in my case 
> this offset is the offset for EST or EDT depending if that timestamp was 
> during EST (winter) or EDT (summer)) 
> 
> 
> This is how Impala achieves what is described as: 
> 
> 
> Impala does not store timestamps using the local timezone, to avoid 
> undesired results from unexpected time zone issues 
> 
> 
> That same page has caveats like the following, that sent a shiver down my 
> spine: 
> 
> 
> If that value was written to a data file, and shipped off to a distant 
> server to be analyzed alongside other data from far-flung locations, the 
> dates and times would not match up precisely because of time zone 
> differences 
> 
> 
> This means that if anyone is using (or even thinking about using) "Impala 
> timestamps" to say store financial or health services (or security) events, 
> they'll find some nasty "surprises" (even if they don't plan to ever move 
> their servers and only do business in one timezone). 
> 
> 
> Consider for instance the case of anything that occurred between 1am and 2am 
> EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT 
> to EST) - there's no way to store the timestamps for these events in Kudu 
> via Impala. 
> 
> To prove this I wrote this simple piece of Java code (which uses Java 8 and 
> all well documented and non-deprecated classes and methods) to do just an 
> insert and a select via Impala JDBC of a timestamp row in the simple table 
> that I mentioned above (primary key + timestamp column): 
> 
> 
> 
> // run insert 
> long primaryKey = 1L; 
> PreparedStatement insert = connection.prepareStatement("insert into " 
> + table + " values (?, ?)"); 
> insert.setLong(1, primaryKey); 
> Timestamp timestampIn = new Timestamp(1509859800000L); 
> System.out.println("TimestampIn: " + timestampIn + " - getTime(): " + 
> timestampIn.getTime()); 
> insert.setTimestamp(2, timestampIn); 
> insert.executeUpdate(); 
> insert.close(); 
> 
> 
> // run select 
> PreparedStatement select = connection.prepareStatement("select " + 
> timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?"); 
> select.setLong(1, primaryKey); 
> ResultSet resultSet = select.executeQuery(); 
> while (resultSet.next()) { 
> Timestamp timestampOut = resultSet.getTimestamp(1); 
> System.out.println("TimestampOut: " + timestampOut + " - 
> getTime(): " + timestampOut.getTime()); 
> } 
> resultSet.close(); 
> select.close(); 
> 
> 
> 
> and this is the output: 
> 
> 
> TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000 
> TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000 
> 
> 
> 
> If this kind of timestamp had been designed say in 1972, I would have 
> probably been OK with it, but in 2017, when people and companies do business 
> 24/7, I find this approach (and sentences like "dates and times would not 
> match up precisely") unacceptable. 
> 
> 
> 
> To its defense, the Impala TIMESTAMP web page mentioned above spends several 
> paragraphs about how to deal with "UTC timestamps" and problems like the 
> ones shown above, but it basically has the user wrap everything in Impala 
> functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'. 
> 
> 
> This proposed solution - besides the performance impact of say having a 
> query like this: 
> 
> 
> select ... where to_utc_timestamp(timestamp_column_a) < 
> to_utc_timestamp(timestamp_column_b) 
> 
> 
> makes also much harder to write queries having timestamp fields (which is 
> very common in my workplace) or to use it with automated BI tools. 
> 
> 
> 
> It is definitely not what one would expect from an analytic MPP database, 
> that aspires to be used at the Enterprise level in 2017. 
> 
> 
> 
> Regards, 
> Franco Venturi

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Posted by Franco Venturi <fv...@comcast.net>.

Greg, 
if you are somewhere on the West Coast (i.e. if your servers are configured to run on the 'America/Los_Angeles' timezone), do you mind changing this line from: 

Timestamp timestampIn = new Timestamp(1509859800000L); 

to: 

Timestamp timestampIn = new Timestamp(1509870600000L); 

and see if it make any difference? 

Also I see you are running a different version of Impalad (2.9.0-cdh5.12.1), while I am on v2.10.0-cdh5.13.0. 

On the other hand, I'll give it a try with the latest JDBC driver provider by Cloudera and see how it goes. 

Franco 


----- Original Message -----

From: "greg rahn" <gr...@gmail.com> 
To: user@impala.apache.org 
Sent: Monday, December 18, 2017 6:46:44 PM 
Subject: Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...) 

It seems to work as expected using the latest version of the Impala JDBC driver. 
Might you give that a try and see if it solves your issue? 

create table t (pk int primary key, ts timestamp) 
partition by hash(pk) partitions 2 
stored as kudu 
tblproperties('kudu.num_tablet_replicas' = '1'); 

*** Database product version: 2.9.0-cdh5.12.1 
*** JDBC Driver version: 02.05.41.1061 
TimestampIn: 2017-11-04 22:30:00.0 - getTime(): 1509859800000 
TimestampOut: 2017-11-04 22:30:00.0 - getTime(): 1509859800000 

On 2017-12-18 04:49, Jeszy <je...@gmail.com> wrote: 
> Hello Franco, 
> 
> Thanks for your feedback! I agree there are pain points with using 
> timestamps, especially together with other systems. 
> Is there any particular approach or solution you propose that would 
> work well for you? Have you found any jiras on issues.apache.org that 
> describe what you're asking for? Commenting on a jira will help the 
> team track your input better. 
> 
> Regards, 
> Balazs 
> 
> On 17 December 2017 at 00:38, Franco Venturi <fv...@comcast.net> wrote: 
> > Please note that the discussion below refers to the following versions: 
> > - Impala: v2.10.0-cdh5.13.0 
> > - Kudu: 1.5.0-cdh5.13.0 
> > - Everything runs on a standard Cloudera 5.13 installation 
> > 
> > 
> > A few days ago I was writing some Java code to migrate several tables 
> > directly from Oracle to Kudu (to be queried later on by our developers and 
> > BI tools using Impala). Most of these tables have columns that are of type 
> > "timestamp" (to be exact, they come in as instances of class 
> > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of 
> > this discussion I'll assume we only deal with objects of java.sql.Timestamp, 
> > to make things simple). 
> > As you probably know, Kudu, starting I think with version 1.3.1, has a type 
> > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala 
> > TIMESTAMP" data type 
> > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html). 
> > 
> > 
> > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the 
> > 'Apache Kudu Schema Design' document 
> > (https://kudu.apache.org/docs/schema_design.html), which says: 
> > 
> > 
> > unixtime_micros (64-bit microseconds since the Unix epoch) 
> > 
> > 
> > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT. 
> > 
> > 
> > With this understanding I went ahead and wrote my Java code; when I ran the 
> > first few tests, I noticed that the timestamp values returned by Impala (I 
> > created in Impala an 'external' table 'stored as kudu') were off by several 
> > hours compared to the values returned by the original table in Oracle (our 
> > servers, both the Oracle ones and the Impala/Kudu ones, are all configured 
> > in the 'America/New_York' timezone). 
> > 
> > 
> > To investigate this difference, I created a simple table in Kudu with just 
> > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a 
> > timestamp. I ran a few inserts and selects over this table in Impala and 
> > figured out that Impala stores a value that is more or less defined as 
> > follow: 
> > 
> > 
> > number of microseconds since the Unix epoch (i.e. what I was expecting 
> > originally) 
> > + offset of the timestamp I inserted with respect to GMT (in my case 
> > this offset is the offset for EST or EDT depending if that timestamp was 
> > during EST (winter) or EDT (summer)) 
> > 
> > 
> > This is how Impala achieves what is described as: 
> > 
> > 
> > Impala does not store timestamps using the local timezone, to avoid 
> > undesired results from unexpected time zone issues 
> > 
> > 
> > That same page has caveats like the following, that sent a shiver down my 
> > spine: 
> > 
> > 
> > If that value was written to a data file, and shipped off to a distant 
> > server to be analyzed alongside other data from far-flung locations, the 
> > dates and times would not match up precisely because of time zone 
> > differences 
> > 
> > 
> > This means that if anyone is using (or even thinking about using) "Impala 
> > timestamps" to say store financial or health services (or security) events, 
> > they'll find some nasty "surprises" (even if they don't plan to ever move 
> > their servers and only do business in one timezone). 
> > 
> > 
> > Consider for instance the case of anything that occurred between 1am and 2am 
> > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT 
> > to EST) - there's no way to store the timestamps for these events in Kudu 
> > via Impala. 
> > 
> > To prove this I wrote this simple piece of Java code (which uses Java 8 and 
> > all well documented and non-deprecated classes and methods) to do just an 
> > insert and a select via Impala JDBC of a timestamp row in the simple table 
> > that I mentioned above (primary key + timestamp column): 
> > 
> > 
> > 
> > // run insert 
> > long primaryKey = 1L; 
> > PreparedStatement insert = connection.prepareStatement("insert into " 
> > + table + " values (?, ?)"); 
> > insert.setLong(1, primaryKey); 
> > Timestamp timestampIn = new Timestamp(1509859800000L); 
> > System.out.println("TimestampIn: " + timestampIn + " - getTime(): " + 
> > timestampIn.getTime()); 
> > insert.setTimestamp(2, timestampIn); 
> > insert.executeUpdate(); 
> > insert.close(); 
> > 
> > 
> > // run select 
> > PreparedStatement select = connection.prepareStatement("select " + 
> > timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?"); 
> > select.setLong(1, primaryKey); 
> > ResultSet resultSet = select.executeQuery(); 
> > while (resultSet.next()) { 
> > Timestamp timestampOut = resultSet.getTimestamp(1); 
> > System.out.println("TimestampOut: " + timestampOut + " - 
> > getTime(): " + timestampOut.getTime()); 
> > } 
> > resultSet.close(); 
> > select.close(); 
> > 
> > 
> > 
> > and this is the output: 
> > 
> > 
> > TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000 
> > TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000 
> > 
> > 
> > 
> > If this kind of timestamp had been designed say in 1972, I would have 
> > probably been OK with it, but in 2017, when people and companies do business 
> > 24/7, I find this approach (and sentences like "dates and times would not 
> > match up precisely") unacceptable. 
> > 
> > 
> > 
> > To its defense, the Impala TIMESTAMP web page mentioned above spends several 
> > paragraphs about how to deal with "UTC timestamps" and problems like the 
> > ones shown above, but it basically has the user wrap everything in Impala 
> > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'. 
> > 
> > 
> > This proposed solution - besides the performance impact of say having a 
> > query like this: 
> > 
> > 
> > select ... where to_utc_timestamp(timestamp_column_a) < 
> > to_utc_timestamp(timestamp_column_b) 
> > 
> > 
> > makes also much harder to write queries having timestamp fields (which is 
> > very common in my workplace) or to use it with automated BI tools. 
> > 
> > 
> > 
> > It is definitely not what one would expect from an analytic MPP database, 
> > that aspires to be used at the Enterprise level in 2017. 
> > 
> > 
> > 
> > Regards, 
> > Franco Venturi 
>

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Posted by "greg.rahn@gmail.com" <gr...@gmail.com>.

It seems to work as expected using the latest version of the Impala JDBC driver.  
Might you give that a try and see if it solves your issue?

create table t (pk int primary key, ts timestamp) 
partition by hash(pk) partitions 2 
stored as kudu
tblproperties('kudu.num_tablet_replicas' = '1');

*** Database product version: 2.9.0-cdh5.12.1
*** JDBC Driver version: 02.05.41.1061
TimestampIn:  2017-11-04 22:30:00.0 - getTime(): 1509859800000
TimestampOut: 2017-11-04 22:30:00.0 - getTime(): 1509859800000

On 2017-12-18 04:49, Jeszy <je...@gmail.com> wrote: 
> Hello Franco,
> 
> Thanks for your feedback! I agree there are pain points with using
> timestamps, especially together with other systems.
> Is there any particular approach or solution you propose that would
> work well for you? Have you found any jiras on issues.apache.org that
> describe what you're asking for? Commenting on a jira will help the
> team track your input better.
> 
> Regards,
> Balazs
> 
> On 17 December 2017 at 00:38, Franco Venturi <fv...@comcast.net> wrote:
> > Please note that the discussion below refers to the following versions:
> >       - Impala: v2.10.0-cdh5.13.0
> >       - Kudu: 1.5.0-cdh5.13.0
> >       - Everything runs on a standard Cloudera 5.13 installation
> >
> >
> > A few days ago I was writing some Java code to migrate several tables
> > directly from Oracle to Kudu (to be queried later on by our developers and
> > BI tools using Impala). Most of these tables have columns that are of type
> > "timestamp" (to be exact, they come in as instances of class
> > oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of
> > this discussion I'll assume we only deal with objects of java.sql.Timestamp,
> > to make things simple).
> > As you probably know, Kudu, starting I think with version 1.3.1, has a type
> > called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> > TIMESTAMP" data type
> > (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html).
> >
> >
> > A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> > 'Apache Kudu Schema Design' document
> > (https://kudu.apache.org/docs/schema_design.html), which says:
> >
> >
> >       unixtime_micros (64-bit microseconds since the Unix epoch)
> >
> >
> > where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
> >
> >
> > With this understanding I went ahead and wrote my Java code; when I ran the
> > first few tests, I noticed that the timestamp values returned by Impala (I
> > created in Impala an 'external' table 'stored as kudu') were off by several
> > hours compared to the values returned by the original table in Oracle (our
> > servers, both the Oracle ones and the Impala/Kudu ones, are all configured
> > in the 'America/New_York' timezone).
> >
> >
> > To investigate this difference, I created a simple table in Kudu with just
> > two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> > timestamp. I ran a few inserts and selects over this table in Impala and
> > figured out that Impala stores a value that is more or less defined as
> > follow:
> >
> >
> >       number of microseconds since the Unix epoch (i.e. what I was expecting
> > originally)
> >       + offset of the timestamp I inserted with respect to GMT (in my case
> > this offset is the offset for EST or EDT depending if that timestamp was
> > during EST (winter) or EDT (summer))
> >
> >
> > This is how Impala achieves what is described as:
> >
> >
> >       Impala does not store timestamps using the local timezone, to avoid
> > undesired results from unexpected time zone issues
> >
> >
> > That same page has caveats like the following, that sent a shiver down my
> > spine:
> >
> >
> >       If that value was written to a data file, and shipped off to a distant
> > server to be analyzed alongside other data from far-flung locations, the
> > dates and times would not match up precisely because of time zone
> > differences
> >
> >
> > This means that if anyone is using (or even thinking about using) "Impala
> > timestamps" to say store financial or health services (or security) events,
> > they'll find some nasty "surprises" (even if they don't plan to ever move
> > their servers and only do business in one timezone).
> >
> >
> > Consider for instance the case of anything that occurred between 1am and 2am
> > EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT
> > to EST) - there's no way to store the timestamps for these events in Kudu
> > via Impala.
> >
> > To prove this I wrote this simple piece of Java code (which uses Java 8 and
> > all well documented and non-deprecated classes and methods) to do just an
> > insert and a select via Impala JDBC of a timestamp row in the simple table
> > that I mentioned above (primary key + timestamp column):
> >
> >
> >
> >       // run insert
> >       long primaryKey = 1L;
> >       PreparedStatement insert = connection.prepareStatement("insert into "
> > + table + " values (?, ?)");
> >       insert.setLong(1, primaryKey);
> >       Timestamp timestampIn = new Timestamp(1509859800000L);
> >       System.out.println("TimestampIn: " + timestampIn + " - getTime(): " +
> > timestampIn.getTime());
> >       insert.setTimestamp(2, timestampIn);
> >       insert.executeUpdate();
> >       insert.close();
> >
> >
> >       // run select
> >       PreparedStatement select = connection.prepareStatement("select " +
> > timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?");
> >       select.setLong(1, primaryKey);
> >       ResultSet resultSet = select.executeQuery();
> >       while (resultSet.next()) {
> >           Timestamp timestampOut = resultSet.getTimestamp(1);
> >           System.out.println("TimestampOut: " + timestampOut + " -
> > getTime(): " + timestampOut.getTime());
> >       }
> >       resultSet.close();
> >       select.close();
> >
> >
> >
> > and this is the output:
> >
> >
> >      TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
> >      TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000
> >
> >
> >
> > If this kind of timestamp had been designed say in 1972, I would have
> > probably been OK with it, but in 2017, when people and companies do business
> > 24/7, I find this approach (and sentences like "dates and times would not
> > match up precisely") unacceptable.
> >
> >
> >
> > To its defense, the Impala TIMESTAMP web page mentioned above spends several
> > paragraphs about how to deal with "UTC timestamps" and problems like the
> > ones shown above, but it basically has the user wrap everything in Impala
> > functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.
> >
> >
> > This proposed solution - besides the performance impact of say having a
> > query like this:
> >
> >
> >       select ... where to_utc_timestamp(timestamp_column_a) <
> > to_utc_timestamp(timestamp_column_b)
> >
> >
> > makes also much harder to write queries having timestamp fields (which is
> > very common in my workplace) or to use it with automated BI tools.
> >
> >
> >
> > It is definitely not what one would expect from an analytic MPP database,
> > that aspires to be used at the Enterprise level in 2017.
> >
> >
> >
> > Regards,
> > Franco Venturi
>

Re: Impala, Kudu, and timestamps (and UNIXTIME_MICROS...)

Posted by Jeszy <je...@gmail.com>.

Hello Franco,

Thanks for your feedback! I agree there are pain points with using
timestamps, especially together with other systems.
Is there any particular approach or solution you propose that would
work well for you? Have you found any jiras on issues.apache.org that
describe what you're asking for? Commenting on a jira will help the
team track your input better.

Regards,
Balazs

On 17 December 2017 at 00:38, Franco Venturi <fv...@comcast.net> wrote:
> Please note that the discussion below refers to the following versions:
>       - Impala: v2.10.0-cdh5.13.0
>       - Kudu: 1.5.0-cdh5.13.0
>       - Everything runs on a standard Cloudera 5.13 installation
>
>
> A few days ago I was writing some Java code to migrate several tables
> directly from Oracle to Kudu (to be queried later on by our developers and
> BI tools using Impala). Most of these tables have columns that are of type
> "timestamp" (to be exact, they come in as instances of class
> oracle.sql.TIMESTAMP and I cast them to java.sql.Timestamp; for the rest of
> this discussion I'll assume we only deal with objects of java.sql.Timestamp,
> to make things simple).
> As you probably know, Kudu, starting I think with version 1.3.1, has a type
> called 'UNIXTIME_MICROS') and that type gets mapped by Impala as "Impala
> TIMESTAMP" data type
> (https://www.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html).
>
>
> A good description of the meaning of 'UNIXTIME_MICROS' in Kudu is in the
> 'Apache Kudu Schema Design' document
> (https://kudu.apache.org/docs/schema_design.html), which says:
>
>
>       unixtime_micros (64-bit microseconds since the Unix epoch)
>
>
> where the 'Unix epoch' is defined as 1/1/1970 00:00:00 GMT.
>
>
> With this understanding I went ahead and wrote my Java code; when I ran the
> first few tests, I noticed that the timestamp values returned by Impala (I
> created in Impala an 'external' table 'stored as kudu') were off by several
> hours compared to the values returned by the original table in Oracle (our
> servers, both the Oracle ones and the Impala/Kudu ones, are all configured
> in the 'America/New_York' timezone).
>
>
> To investigate this difference, I created a simple table in Kudu with just
> two columns, an INT64 as the primary key and a UNIXTIME_MICROS as a
> timestamp. I ran a few inserts and selects over this table in Impala and
> figured out that Impala stores a value that is more or less defined as
> follow:
>
>
>       number of microseconds since the Unix epoch (i.e. what I was expecting
> originally)
>       + offset of the timestamp I inserted with respect to GMT (in my case
> this offset is the offset for EST or EDT depending if that timestamp was
> during EST (winter) or EDT (summer))
>
>
> This is how Impala achieves what is described as:
>
>
>       Impala does not store timestamps using the local timezone, to avoid
> undesired results from unexpected time zone issues
>
>
> That same page has caveats like the following, that sent a shiver down my
> spine:
>
>
>       If that value was written to a data file, and shipped off to a distant
> server to be analyzed alongside other data from far-flung locations, the
> dates and times would not match up precisely because of time zone
> differences
>
>
> This means that if anyone is using (or even thinking about using) "Impala
> timestamps" to say store financial or health services (or security) events,
> they'll find some nasty "surprises" (even if they don't plan to ever move
> their servers and only do business in one timezone).
>
>
> Consider for instance the case of anything that occurred between 1am and 2am
> EDT on 11/5/2017 (i.e. in the hour before we moved our clocks back from EDT
> to EST) - there's no way to store the timestamps for these events in Kudu
> via Impala.
>
> To prove this I wrote this simple piece of Java code (which uses Java 8 and
> all well documented and non-deprecated classes and methods) to do just an
> insert and a select via Impala JDBC of a timestamp row in the simple table
> that I mentioned above (primary key + timestamp column):
>
>
>
>       // run insert
>       long primaryKey = 1L;
>       PreparedStatement insert = connection.prepareStatement("insert into "
> + table + " values (?, ?)");
>       insert.setLong(1, primaryKey);
>       Timestamp timestampIn = new Timestamp(1509859800000L);
>       System.out.println("TimestampIn: " + timestampIn + " - getTime(): " +
> timestampIn.getTime());
>       insert.setTimestamp(2, timestampIn);
>       insert.executeUpdate();
>       insert.close();
>
>
>       // run select
>       PreparedStatement select = connection.prepareStatement("select " +
> timestampColumn + " from " + table + " where " + primaryKeyColumn + "=?");
>       select.setLong(1, primaryKey);
>       ResultSet resultSet = select.executeQuery();
>       while (resultSet.next()) {
>           Timestamp timestampOut = resultSet.getTimestamp(1);
>           System.out.println("TimestampOut: " + timestampOut + " -
> getTime(): " + timestampOut.getTime());
>       }
>       resultSet.close();
>       select.close();
>
>
>
> and this is the output:
>
>
>      TimestampIn: 2017-11-05 01:30:00.0 - getTime(): 1509859800000
>      TimestampOut: 2017-11-05 01:30:00.0 - getTime(): 1509863400000
>
>
>
> If this kind of timestamp had been designed say in 1972, I would have
> probably been OK with it, but in 2017, when people and companies do business
> 24/7, I find this approach (and sentences like "dates and times would not
> match up precisely") unacceptable.
>
>
>
> To its defense, the Impala TIMESTAMP web page mentioned above spends several
> paragraphs about how to deal with "UTC timestamps" and problems like the
> ones shown above, but it basically has the user wrap everything in Impala
> functions like 'to_utc_timestamp()' and 'from_utc_timestamp()'.
>
>
> This proposed solution - besides the performance impact of say having a
> query like this:
>
>
>       select ... where to_utc_timestamp(timestamp_column_a) <
> to_utc_timestamp(timestamp_column_b)
>
>
> makes also much harder to write queries having timestamp fields (which is
> very common in my workplace) or to use it with automated BI tools.
>
>
>
> It is definitely not what one would expect from an analytic MPP database,
> that aspires to be used at the Enterprise level in 2017.
>
>
>
> Regards,
> Franco Venturi