You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/25 21:33:30 UTC

[GitHub] [arrow] rok commented on a change in pull request #10997: ARROW-13218: [Format] Clarify interpretation of timestamp values

rok commented on a change in pull request #10997:
URL: https://github.com/apache/arrow/pull/10997#discussion_r696129093



##########
File path: format/Schema.fbs
##########
@@ -214,58 +214,123 @@ table Time {
   bitWidth: int = 32;
 }
 
-/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, excluding
-/// leap seconds, as a 64-bit integer. Note that UNIX time does not include
-/// leap seconds.
+/// Timestamp is a 64-bit signed integer representing an elapsed time since a
+/// fixed epoch, stored in either of four units: seconds, milliseconds,
+/// microseconds or nanoseconds, and is optionally annotated with a timezone.
+///
+/// Timestamp values do not include any leap seconds (in other words, all
+/// days are considered 86400 seconds long).
+///
+/// Timestamps with a non-empty timezone
+/// ------------------------------------
+///
+/// If a Timestamp column has a non-empty timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in the *UTC* timezone
+/// (the Unix epoch), regardless of the Timestamp's own timezone.
+///
+/// Therefore, timestamp values with a non-empty timezone correspond to
+/// physical points in time together with some additional information about
+/// how the data was obtained and/or how to display it (the timezone).
+///
+///   For example, the timestamp value 0 with the timezone string "Europe/Paris"
+///   corresponds to "January 1st 1970, 00h00" in the UTC timezone, but could
+///   also be displayed as "January 1st 1970, 01h00" in the Europe/Paris timezone
+///   (which is the same physical point in time).
+///
+/// One consequence is that timestamp values with a non-empty timezone
+/// can be compared and ordered directly, since they all share the same
+/// well-known point of reference (the Unix epoch).
+///
+/// Timestamps with an unset / empty timezone
+/// -----------------------------------------
+///
+/// If a Timestamp column has no timezone value, its epoch is
+/// 1970-01-01 00:00:00 (January 1st 1970, midnight) in an *unknown* timezone.
+///
+/// Therefore, timestamp values without a timezone cannot be meaningfully
+/// interpreted as physical points in time, but only as calendar / clock
+/// indications ("wall clock time") in an unspecified timezone.
+///
+///   For example, the timestamp value 0 with an empty timezone string
+///   corresponds to "January 1st 1970, 00h00" in an unknown timezone: there
+///   is not enough information to interpret it as a well-defined physical
+///   point in time.
+///
+/// One consequence is that timestamp values without an timezone cannot
+/// be reliably compared or ordered, since they may have different points of
+/// reference.  In particular, it is *not* possible to interpret an unset
+/// or empty timezone as the same as "UTC".
+///
+/// Conversion between timezones
+/// ----------------------------
+///
+/// If a Timestamp column has a non-empty timezone, changing the timezone
+/// to a different non-empty value is a metadata-only operation:
+/// the timestamp values need not change as their point of reference remains
+/// the same (the Unix epoch).
+///
+/// However, if a Timestamp column has no timezone value, changing it to a
+/// non-empty value requires to think about the desired semantics.
+/// One possibility is to assume that the original timestamp values are
+/// relative to the epoch of the timezone being set; timestamp values should
+/// then be "unlocalized" by adjusting them to the Unix epoch
+/// (for example, changing the timezone from empty to "Europe/Paris" would
+///  require converting the timestamp values from "Europe/Paris" to "UTC",
+///  which seems counter-intuitive but is nevertheless correct).
+///
+/// Guidelines for encoding data from external libraries
+/// ----------------------------------------------------
 ///
 /// Date & time libraries often have multiple different data types for temporal
-/// data.  In order to ease interoperability between different implementations the
+/// data. In order to ease interoperability between different implementations the
 /// Arrow project has some recommendations for encoding these types into a Timestamp
 /// column.
 ///
-/// An "instant" represents a single moment in time that has no meaningful time zone
-/// or the time zone is unknown.  A column of instants can also contain values from
-/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+/// An "instant" represents a physical point in time that has no relevant time zone
+/// (for example, astronomical data). To encode an instant, use a Timestamp with
+/// the timezone string set to "UTC", and make sure the Timestamp values
+/// are relative to the UTC epoch (January 1st 1970, midnight).
+///
+/// A "zoned date-time" represents a physical point in time annotated with an
+/// informative time zone (for example, the time zone in which the data was
+/// recorded).  To encode a zoned date-time, use a Timestamp with the timezone
+/// string set to the name of the timezone, and make sure the Timestamp values
+/// are relative to the UTC epoch (January 1st 1970, midnight).
 ///
-/// A "zoned date-time" represents a single moment in time that has a meaningful
-/// reference time zone.  To encode a zoned date-time as a Timestamp set the timezone
-/// string to the name of the timezone.  There is some ambiguity between an instant
-/// and a zoned date-time with the UTC time zone.  Both of these are stored the same.
-/// Typically, this distinction does not matter.  If it does, then an application should
-/// use custom metadata or an extension type to distinguish between the two cases.
+///  (There is some ambiguity between an instant and a zoned date-time with the
+///   UTC time zone.  Both of these are stored the same in Arrow.  Typically,
+///   this distinction does not matter.  If it does, then an application should
+///   use custom metadata or an extension type to distinguish between the two cases.)
 ///
-/// An "offset date-time" represents a single moment in time combined with a meaningful
-/// offset from UTC.  To encode an offset date-time as a Timestamp set the timezone string
-/// to the numeric time zone offset string (e.g. "+03:00").
+/// An "offset date-time" represents a physical point in time combined with an
+/// explicit offset from UTC.  To encode an offset date-time, use a Timestamp
+/// with the timezone string set to the numeric time zone offset string
+/// (e.g. "+03:00"), and make sure the Timestamp values are relative to
+/// the UTC epoch (January 1st 1970, midnight).
 ///
-/// A "local date-time" does not represent a single moment in time.  It represents a wall
-/// clock time combined with a date.  Because of daylight savings time there may multiple
-/// instants that correspond to a single local date-time in any given time zone.  A
-/// local date-time is often stored as a struct or a Date32/Time64 pair.  However, it can
-/// also be encoded into a Timestamp column.  To do so the value should be the the time
-/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
-/// The timezone string should be set to null or the empty string.
+/// A "naive date-time" represents a wall clock time combined with a calendar
+/// date, but with no indication of how to map this information to a physical
+/// point in time. Naive date-times must be handled with care because of
+/// this missing information, and also because daylight saving time (DST)
+/// may make some values ambiguous. A naive date-time may be stored as a

Review comment:
       ```suggestion
   /// may make some values non-existent or ambiguous. A naive date-time may be stored as a
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org