You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/30 09:41:32 UTC

[GitHub] [arrow] westonpace opened a new pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

westonpace opened a new pull request #10629:
URL: https://github.com/apache/arrow/pull/10629


   I've made an attempt to refine the recent discussions into an updated comment describing the timestamp column.  Since the entire discussion has been around fine-grained semantic concepts I will appreciate even minor suggestions to improve the wording.  There are still votes ongoing so this shouldn't be merged until those resolve.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#issuecomment-879491179


   Any remaining comments or suggestions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r664130256



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,33 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful

Review comment:
       I would personally not use CamelClass names for those concepts. That way, it seems to indicate this is a specific Arrow type or class (since this is the Arrow specification), rather than a general description of concepts, which could cause confusion (and it are not generally used names, AFAIK only Java uses exactly those names). 
   
   Describing things in a "neutral" way is hard of course (since everybody comes to this with their own background with specific terminology), but so to avoid bike-shedding on the terms, I would suggest to just not use CameCase. So for example, alternatively could use "zoned date-time" (with a space, so it doesn't mimic a class name).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r664130256



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,33 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful

Review comment:
       I would personally not use CamelClass names for those concepts. That way, it seems to indicate this is a specific arrow type or class (since this is the arrow specification), rather than a general description, which could cause confusion (and it are not generally used names, AFAIK only Java uses exactly those names). 
   
   Describing things in a "neutral" way is hard of course (since everybody comes to this with their own background with specific terminology), but so to avoid bike-shedding on the terms, I would suggest to just not use CameCase. So for example, alternatively could use "zoned date-time" (with a space, so it doesn't mimic a class name).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r662398027



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.

Review comment:
       I think empty string also works.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661505356



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can

Review comment:
       Just to double check, the "is often stored as…" is commenting about systems outside of Arrow, right? It might be useful to clarify that the "it can be also be encoded…" as noting that Arrow's canonical encoding for LocalDateTime is this encoding, maybe something like "Arrow represents LocalDateTime as a Timestamp as follows…"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #10629:
URL: https://github.com/apache/arrow/pull/10629


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661591173



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.
+///
+/// In ALL cases the values of the Timestamp column should be the time elapsed from the

Review comment:
       Maybe I should word it as "In ALL cases the values of the Timestamp column should be **a** time elapsed from the Unix epoch."
   
   I very much intend to include LocalDateTime in this sentence because it is being encoded as a UTC offset.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] adamhooper commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
adamhooper commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661536702



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.
+///
+/// In ALL cases the values of the Timestamp column should be the time elapsed from the

Review comment:
       All cases _except_ "LocalDateTime". (Maybe this paragraph should be removed, and extra wording should appear in the "ZonedDateTime" section instead?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r662681501



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".

Review comment:
       I put a statement in there pointing out the potential ambiguity.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r662396553



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".

Review comment:
       we might want to clarify that this encoding loses details that the value was in fact an instant.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r665229764



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,33 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful

Review comment:
       Thank you for the suggestion.  I've switched to natural names.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r662681591



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.
+///
+/// In ALL cases the values of the Timestamp column should be the time elapsed from the

Review comment:
       I agree the sentence was redundant.  I went ahead and removed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661592347



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.
+///
+/// In ALL cases the values of the Timestamp column should be the time elapsed from the

Review comment:
       Ok, I changed the wording to "in ALL cases the values of the Timestamp column should be an offset from the Unix epoch".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r664130256



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,33 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful

Review comment:
       I would personally not use CamelClass names for those concepts. That way, it seems to indicate this is a specific arrow type or class (since this is the arrow specification), rather than a general description, which could cause confusion (and it are not generally used names, AFAIK only Java uses exactly those names). 
   
   Describing things in a "neutral" way is hard of course (since everybody comes to this with their own background with specific terminology), but so to avoid bike-shedding on the terms, I would suggest to just not use CameCase. So for example, alternatively could use "zoned date-time" (with a space, so it doesn't mimic a class name).

##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,33 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful

Review comment:
       I would personally not use CamelClass names for those concepts. That way, it seems to indicate this is a specific Arrow type or class (since this is the Arrow specification), rather than a general description of concepts, which could cause confusion (and it are not generally used names, AFAIK only Java uses exactly those names). 
   
   Describing things in a "neutral" way is hard of course (since everybody comes to this with their own background with specific terminology), but so to avoid bike-shedding on the terms, I would suggest to just not use CameCase. So for example, alternatively could use "zoned date-time" (with a space, so it doesn't mimic a class name).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] adamhooper commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
adamhooper commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661606855



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.
+///
+/// In ALL cases the values of the Timestamp column should be the time elapsed from the

Review comment:
       It sounds like this sentence is trying to clarify, but I think _removing_ it might clarify more.
   
   Looking at two subsequent paragraphs in the spec:
   
   * "LocalDateTime": "... value should be the the time elapsed from the Unix epoch so that a wall clock in UTC would display the desired time."
   * "ALL": "value should be the time elapsed from the Unix epoch".
   
   The "ALL" sentence is grammatically correct -- yes, both descriptions include the literal phrase, "time elapsed from the Unix epoch." But as a spec reader, when I read a spec that has two strikingly similar sentences so close together, I ask myself: "is this clarification saying the same thing, or is it saying the opposite thing?" It's easy to interpret this "ALL" sentence as meaning the opposite thing ... until I go back and re-read three times and feel slightly annoyed.
   
   I was going to suggest moving this sentence to the top of the definition, because that would be simpler to understand. But ... it's already there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#issuecomment-871252141


   https://issues.apache.org/jira/browse/ARROW-13218


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661591843



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can

Review comment:
       Ok - I think I understand now, thanks for clarifying.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] adamhooper commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
adamhooper commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661536702



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.
+///
+/// In ALL cases the values of the Timestamp column should be the time elapsed from the

Review comment:
       All cases _except_ "LocalDateTime".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661589118



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can

Review comment:
       I'm not sure I agree.  I think an application is free to store a LocalDateTime in whatever form suits them.  For example, if the application is planning on doing mostly field extraction & local time display it may make sense to store in a larger struct form to save time on division.
   
   I don't think the statement "If you have a LocalDateTime you can expect it to be encoded as a Timestamp column" is true.
   
   I do think the statement "If you have a Timestamp column with no time zone you can expect it to be a LocalDateTime" is true.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r662681338



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.

Review comment:
       I changed it to mention empty string as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r662681743



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".

Review comment:
       It's on line 232 at the moment.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] adamhooper commented on a change in pull request #10629: ARROW-13218: [Doc] Document/clarify conventions for timestamp storage

Posted by GitBox <gi...@apache.org>.
adamhooper commented on a change in pull request #10629:
URL: https://github.com/apache/arrow/pull/10629#discussion_r661606855



##########
File path: format/Schema.fbs
##########
@@ -218,8 +218,34 @@ table Time {
 /// leap seconds, as a 64-bit integer. Note that UNIX time does not include
 /// leap seconds.
 ///
-/// The Timestamp metadata supports both "time zone naive" and "time zone
-/// aware" timestamps. Read about the timezone attribute for more detail
+/// Date & time libraries often have multiple different data types for temporal
+/// data.  In order to ease interoperability between different implementations the
+/// Arrow project has some recommendations for encoding these types into a Timestamp
+/// column.
+///
+/// An "Instant" represents a single moment in time that has no meaningful time zone
+/// or the time zone is unknown.  A column of Instants can also contain values from
+/// multiple time zones.  To encode an instant set the timezone string to "UTC".
+///
+/// A "ZonedDateTime" represents a single moment in time that has a meaningful
+/// reference time zone.  To encode a ZonedDateTime as a Timestamp set the timezone
+/// string to the name of the timezone.
+///
+/// An "OffsetDateTime" represents a single moment in time combined with a meaningful
+/// offset from UTC.  To encode an OffsetDateTime as a Timestamp set the timezone string
+/// to the numeric time zone offset string (e.g. "+03:00").
+///
+/// A "LocalDateTime" does not represent a single moment in time.  It represents a wall
+/// clock time combined with a date.  Because of daylight savings time there may multiple
+/// Instants that correspond to a single LocalDateTime in any given time zone.  A
+/// LocalDateTime is often stored as a struct or a Date32/Time64 pair.  However, it can
+/// also be encoded into a Timestamp column.  To do so the value should be the the time
+/// elapsed from the Unix epoch so that a wall clock in UTC would display the desired time.
+/// Futhermore, the timezone string should be set to null.
+///
+/// In ALL cases the values of the Timestamp column should be the time elapsed from the

Review comment:
       It sounds like this sentence is trying to clarify, but I think _removing_ it might clarify more.
   
   Looking at two subsequent paragraphs in the spec:
   
   * "LocalDateTime": "... value should be the the time elapsed from the Unix epoch so that a wall clock in UTC would display the desired time."
   * "ALL": "value should be the time elapsed from the Unix epoch".
   
   The "ALL" sentence is grammatically correct -- yes, both descriptions include the literal phrase, "time elapsed from the Unix epoch." But as a spec reader, when I read a spec that has two strikingly similar sentences so close together, I ask myself: "is this clarification saying the same thing, or is it saying the opposite thing?" It's easy to interpret this "ALL" sentence as meaning the opposite thing ... until I go back and re-read three times and feel slightly annoyed.
   
   I was going to suggest moving this sentence to the top of the definition, because that would be simpler to understand. But ... it's already there:
   
   ```
   /// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, excluding
   /// leap seconds, as a 64-bit integer.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org