You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/05 22:34:41 UTC

[GitHub] [arrow-datafusion] velvia opened a new issue #686: Specific timezone support for `to_timetamp*()`

velvia opened a new issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686


   **Describe the bug**
   I'm not sure I'd call it a "bug" per se, but all of the `to_timestamp*()` functions output `TimestampTypes` in Arrow that have a Timezone field of None.      The issues here are:
   - There is no way to specify a specific timezone currently to convert to
   - Timestamp types with different timezones are not comparable / compatible types.  For example, if I have data with a timezone of `Some("UTC")` the following fails due to incompatible types: `WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z")` (because timestamp_col has Some(UTC) but to_timestamp returns None)
   
   I'm aware there was a recent vote to treat the None timestamp like a local timestamp, but this isn't always the case either.  For example, we have been using the output of None timestamps from DataFusion but we treat all our timestamps as UTC internally.
   
   **To Reproduce**
   Steps to reproduce the behavior:
   I can create a PR to have a test to show the above comparison failing.  Take some data with a timezone with UTC and do the above comparison.
   
   **Expected behavior**
   The best solution I can think of would be for `to_timestamp(...)` to support a second, optional argument where the timezone can be specified.  This allows for casting/conversion and generation of timestamps with a specific timezone requirement.  There are some technical challenges which can be overcome:
   - Current code in datetime_expressions.rs for example uses macros which assume a static, const type, which makes it difficult to input variable timezone args
   - The function signature checks would be really complicated
   
   There are other "hacks" which are possible:
   - For our use case, `Some("UTC")` is equivalent to `None`, so we could make Some("UTC") cast to None and vice versa, but this would lose precision for many people.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] velvia commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-881755559


   @alamb @westonpace and others: just to continue this conversation and try and bring it to a close.
   
   1) It sounds like the Arrow/Rust community would like to transition to supporting `Timestamp(_, UTC)` as the standard for UTC-based processing, rather than `Timestamp(_, None)`.  
   2) Since some people already use the existing `Timestamp(_, None)` to really mean `Timestamp(_, UTC)` (including us in some cases), and there is also data (in Parquet) that has at least UTC timezone info, we should provide interop to convert between the two, or at least cast things between.
   3) A `timezone()` function or `with_timezone()` or `at_timezone()` function would be clearer than adding a second argument to `to_timestamp(...)`
   
   Do the above sound about right?
   
   I would also like to observe current obstacles to 1) and 2) in the Arrow/DF codebase (there are more, just what I've found in the last ~2 weeks):
   - `ScalarValue` in DataFusion only supports Timestamp types with None timezone
   - Coercion of Timestamp(_, UTC) to Timestamp(_, None) (or the other way around) is potentially wrong and also has some schema type check issues
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adamhooper commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
adamhooper commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884485767


   > to_timestamp('1970-01-01', 'UTC')	`Timestamp(Nanoseconds, Utc)`	5*3600000000000 (5 hour of nanoseconds) **assuming code was run on a computer set to EDT, UTC-5:00**
   
   Oh, I see -- this is existing behavior. But then, does existing behavior define a different result for row 1? https://github.com/apache/arrow-datafusion/blob/5900b4c6829b0bdbe69e1f95fb74e935bc8f33d4/datafusion/src/physical_plan/datetime_expressions.rs#L85
   
   In my view, if the existing behavior has to stay, it has to stay; but if it _doesn't_ have to stay, then as a user I'd feel much more comfortable if DataFusion didn't expose the server's timezone -- or at least made it a session variable, as Postgres does.
   
   Note that we're up to _four_ timezones in every call to `TO_TIMESTAMP()`:
   
   * `NULL`: "localtime" the Arrow concept -- calendar-date-plus-wall-time
   * `UTC`
   * The input-string timezone offset
   * DataFusion server's "local timezone" (distinct from Arrow's "local time", and presumably unknown to my users)
   
   My use case is to run custom SQL from the general public. I realize I can use environment variables to force DataFusion's "server timezone" to be UTC; but that doesn't negate the learning curve.
   
   Postgres has the same complexity. Its syntax:
   
   * `SELECT timestamp_with_time_zone AT TIME ZONE 'America/Eastern'` => converts UTC to localtime
   * `SELECT timestamp_with_time_zone::TIMESTAMP` => converts UTC to session-timezone localtime
   * `SELECT timestamp_without_time_zone AT TIME ZONE 'America/Eastern' => converts localtime to UTC
   * `SELECT timestamp_without_time_zone::TIMESTAMPTZ` => converts localtime to UTC using session-timezone localtime
   
   ... and I'll bet most Postgres misunderstand these concepts for years.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rok edited a comment on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
rok edited a comment on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884535895


   Agreed with @adamhooper - as a user I would strongly dislike the row 4 behavior as it would make my logic a function of the environment I'd run it on instead of only a function of the input.
   
   Keep in mind that timezones can have nonexistent and ambiguous times that need to be handled when "localizing" (converting wall time) to UTC. In arrow there's an [open PR](https://github.com/apache/arrow/pull/10610/files#diff-19248dfd2de25c17622290aae9c78024ba474de2913e44bc8cf0dce0bfe7dcda) for a `tz_localize` kernel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-885171173


   ```
   TO_TIMESTAMP_UTC
   TO_TIMESTAMP_LOCALTIME
   ```
   
   Yes, this is what I thought @velvia  was more or less proposing, but instead of those names, use a second optional argument
   
   ```
   to_timestamp(.., 'UTC') --> TO_TIMESTAMP_UTC(..)
   to_timestamp(..) --> TO_TIMESTAMP_LOCALTIME(..)
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] velvia commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-874356725


   @alamb @westonpace .... your thoughts on this matter would be really welcome. :)   
   
   To be specific, if you do `to_timestamp(date_string)` then it can figure out the timezone from the string, sometimes.  However, current architecture of Arrow/DF requires a specific timezone output type.
   If you did `to_timestamp(int64)` then you would not have any timezone info at all, and would require the input in the second argument no matter what for completeness.
   
   There is also the possibility to simplify things and eliminate the timezone field from Arrow, but that's a really big change.  For example, Parquet does not store the timezone but simply has a isAdjustedToUTC field.  If True, that means timestamps are UTC, and if not that means local timestamp.  
   https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rok edited a comment on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
rok edited a comment on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884535895


   Agreed with @adamhooper - as a user I would strongly dislike the row 4 behavior as it would make my logic a function of the environment I'd run it on instead of only a function the input.
   
   Keep in mind that timezones can have nonexistent and ambiguous times that need to be handled when "localizing" (converting wall time) to UTC. In arrow there's an [open PR](https://github.com/apache/arrow/pull/10610/files#diff-19248dfd2de25c17622290aae9c78024ba474de2913e44bc8cf0dce0bfe7dcda) for a `tz_localize` kernel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-874907741


   @velvia 
   
   > we have been using the output of None timestamps from DataFusion but we treat all our timestamps as UTC internally.
   
   FWIW this is what we do in IOx as well
   
   > The best solution I can think of would be for to_timestamp(...) to support a second, optional argument where the timezone can be specified. 
   
   I think a second optional argument for `to_timestamp` is a fine solution that would be backwards compatible and also allow for improvements going forward
   
   > For example, if I have data with a timezone of Some("UTC") the following fails due to incompatible types: WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z") (because timestamp_col has Some(UTC) but to_timestamp returns None)
   
   I think a better solution would be for DataFusion to implement coercion rules between date/time/timestamp types. In your example of a timestamp_col with `Timestamp(Milliseconds, Some("UTC")`), then I would *like* DataFusion to handle
   
   ```
   WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z")
   ```
   
   by adding the following cast (please forgive me the pseudo-SQL)
   
   ```
   WHERE timestamp_col > cast (to_timestamp("2021-06-21T12:00Z") as timestamp(UTC))
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adamhooper commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
adamhooper commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-875815801


   > > The best solution I can think of would be for to_timestamp(...) to support a second, optional argument where the timezone can be specified.
   > 
   > I think a second optional argument for `to_timestamp` is a fine solution that would be backwards compatible and also allow for improvements going forward
   
   I'm not even a user (yet); I see two big great reasons not to add an argument:
   
   *1. Ambiguous calling pattern*. When converting string to Arrow timestamp, there are two timezones to consider: the input-string timezone and the output-timestamp timezone. Which one would this argument mean?
   
   If I see the call, `TO_TIMESTAMP('2021-01-01T00:00:00', 'America/Montreal')`, then only result I can imagine as "useful" is `2021-01-01T05:00Z` -- that is, "the input is from my clock." (@westonpace, I think we're aligned here.)
   
   Someone else might expect `2020-12-31T19:00-0500`, saying, "I have UTC input and I want Montreal wall-clock time." A third person might expect `2021-01-01Z` with timezone-metadata `America/Montreal`.
   
   If three smart people reading the same line of code would interpret it three different ways, that's a design problem.
   
   *2. PostgreSQL compatibility*. Postgres has two timestamp types; Arrow timestamp+tz-metadata columns are neither.
   
   In Postgres, `TO_TIMESTAMP()` creates a `TIMESTAMP WITH TIME ZONE` -- that is, a moment on our shared, global timeline, stored as 64-bit integer since the UNIX epoch. It has no timezone: each moment happens once everywhere in the world, regardless of timezone.
   
   Postgres also has a `TIMESTAMP WITHOUT TIME ZONE` type -- that is, a calendar-date-plus-wall-clock type ... but Postgres' own `TO_TIMESTAMP()` doesn't output it. That's on purpose. I think most people using `TIMESTAMP WITHOUT TIME ZONE` are using it in error, misled by the name. It represents the very opposite of what it sounds like. In Postgres, it's handy for calendaring apps and not much else. (There are other uses in Pandas/R.)
   
   Postgres has gone 30 years without Arrow's timestamp+tz-metadata types.
   
   In the words of Tom Lane, ["if we simply took away to_timestamp(), most users would be better off"](https://www.postgresql.org/message-id/1264125.1623208850%40sss.pgh.pa.us).
   
   Be careful about adding a feature: removing one might help more users accomplish more tasks, more smoothly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-875543176


   > Should the current output of to_timestamp(...) default to Timestamp(_, Some(UTC)) ? This sounds like a yes (and would solve my use case)
   
   While this might be theoretically the best option, I think this would be a fairly painful change (at least for us in IOx) as our timestamps are `Timestamp(Nanoseconds, None)` 
   
   > to_timestamp(....) can be extended to add a second argument to accept different time zones, or None (not sure how to specify None if the default is UTC, per item 1)
   
   Perhaps `to_timestamp('....', NULL)`  could work
   
   > Or, simplify everything and have the timestamp types not have timezone, but just if they are UTC or local (similar to Parquet).
   
   I am actually a fan of this direction -- I am fairly sure the use of a Timezone that is something other than `None` will not work well in Arrow / DataFusion (Because I tried using `Timestamp(.., Some("UTC")` in IOx and basic query stuff didn't work
   
   I think it would keep things simpler to convert timestamps on the edge (aka if it was "local time" convert to UTC prior to putting into DataFusion) and then have all the processing machinery deal with UTC timestamps only
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884524231


   > Oh, I see -- this is existing behavior. But then, does existing behavior define a different result for row 1?
   
   on master, if you execute `to_timestamp('1970-01-01')` on a server running `DataFusion` in a UTC-5 timezone, you get `5*3600000000000`
   
   > In my view, if the existing behavior has to stay, it has to stay; 
   
   I don't think it has to say
   
   I think the choice of where to translate a local time into one with a specific timezone is probably best left to a higher level than DataFusion. DataFusion isn't a database system -- it does not have the notion of a client or server. It is more like a tool kit with which to build such things.
   
   > Note that we're up to four timezones in every call to TO_TIMESTAMP():
   
   I think we have three timezones that are important:
   1. The input string timezone offset (may not exist)
   2. The local timezone on which datafusion is run (used in the case the user does not supply a timezone)
   2. The target timezone of the arrow type
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adamhooper commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
adamhooper commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-876681175


   @velvia Great points
   
   My last two suggestions were exactly about interop -- and a transition period. Today, DataFusion users interpret timezone=null to mean `TIMESTAMP WITH TIMEZONE`; but eventually, DataFusion must consider timezone=null to mean `TIMESTAMP WITHOUT TIMEZONE`, right? Users will need to change what they're doing; the transition path will be hard and it depends on the DataFusion community.
   
   I hadn't thought of timestamp resolution. Again, I'm out of my element (for now) :).
   
   The crux of my suggestion is to ignore the `timezone` metadata field (treat it as a boolean, `timezone=null` or `timezone=UTC`). That translates Parquet <=> Arrow <=> PostgreSQL cleanly. Treating it like a boolean keeps the feature list small and saves people from confusion.
   
   As for the `TO_TIMESTAMP()` parameter: Postgres has a different function might do exactly what you're suggesting. [`AT TIME ZONE`](https://www.postgresql.org/docs/13/functions-datetime.html#FUNCTIONS-DATETIME-ZONECONVERT) _toggles_ the "`WITH TIMEZONE`-ness" of a timestamp. It's also callable as a function, `TIMEZONE(zone, timestamp)`.
   
   But -- sidetracking here -- how important are these types? `TIMESTAMP WITH TIME ZONE` is clearly mission-critical. I think `TIMESTAMP WITHOUT TIME ZONE` has its place; but it's in a different ballpark, right. (Even Spark doesn't have `TIMESTAMP WITHOUT TIME ZONE`; and, well, does DataFusion?) Arrow's timezone metadata column is even more obscure: I've only heard of it in Pandas and R.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace edited a comment on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace edited a comment on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884452349


   Does this seem correct?
   
   
   
   Example | Output Arrow Type | Value | Output when printed
   -- | -- | -- | --
   to_timestamp('2021-07-20') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 local time | 2021-07-20T00:00:00
   to_timestamp('2021-07-20+00:00') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T00:00:00
   to_timestamp('2021-07-20+01:00') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 UTC+1 | 2021-07-19T23:00:00
   to_timestamp('2021-07-20', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 local time | 2021-07-19T00:00:00Z
   to_timestamp('2021-07-20+00:00', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T00:00:00Z
   to_timestamp('2021-07-20+01:00', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T23:00:00Z
   
   I think the one most interesting to me is row 4 so double check that one (also not the presence/absence of a trailing Z to indicate local time or not).
   
   By "output when printed" I mean what an implementation would get if it tried to force the value into an ISO-8601 string.
   
   Just a few extra thoughts...
   
   Does `to_timestamp` only accept offsets in the strings (e.g. `+01:00` and not tzdata strings or things like `MST`)?  For reference, ISO-8601 only deals with offsets.
   
   Will Rust ever generate an Output Arrow Type with a time zone string (e.g. MST)?
   
   My point with the last two questions is that offset conversions aren't too bad if you aren't handling tzdata strings.  Once you do then you need a timezone database which is a bit of a pain but solvable.  The C++ impl recently added compute kernels which rely on a timezone database and it's a non-trivial effort.  @rok has been spearheading this effort so you might reach out if you go down this road.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884429679


   Let's make this concrete. Here is a proposal of what I think makes sense (and I think the consistent with @velvia's suggestion for a second argument). 
   
   The second argument simply serves to define the output type; Note to really make this useful we would also have to provide some way within Arrow and DataFusion to cast back and forth between different timestamps, doing the offset conversions appropriately
   
   | Example                                           | Output Arrow Type                    |     Value
   | ------------------------------------------        | ------------------------------------ | ------------------------------------------------------
   | to_timestamp('2021-07-20')               | `Timestamp(Nanoseconds, None)`       | nanos since unix epic until 2021-07-20 local time
   | to_timestamp('2021-07-20+00:00')         | `Timestamp(Nanoseconds, None)`       | nanos since unix epic until 2021-07-20 UTC
   | to_timestamp('2021-07-20+01:00')         | `Timestamp(Nanoseconds, None)`       | nanos since unix epic until 2021-07-20 UTC+1
   | to_timestamp('2021-07-20', 'UTC')        | `Timestamp(Nanoseconds, Utc)`        | nanos since unix epic until 2021-07-20 local time
   | to_timestamp('2021-07-20+00:00', 'UTC')  | `Timestamp(Nanoseconds, Utc)`        | nanos since unix epic until 2021-07-20 UTC,
   | to_timestamp('2021-07-20+01:00', 'UTC')  | `Timestamp(Nanoseconds, Utc)`        | nanos since unix epic until 2021-07-20 UTC __NOTE only this timestamps value needs to be shifted to some other value to be relative to UTC__
   
   
   I think  @adamhooper 's point that "someone might get confused about what timezone the second argument is referring to"  at https://github.com/apache/arrow-datafusion/issues/686#issuecomment-875815801 is legitmate, but one we can handle with appropriate documentation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-876362941


   @adamhooper thank you for the perspectives. I see the challenges with adding a second argument to `to_timestamp` and would love to consider other alternatives. 
   
   Can you suggest a strategy for the usecase described in this ticket, namely "writing a query using SQL against data stored in parquet that has an explicit timezone?"
   
   In other words, how do you suggest supporting the following query?
   
   ```sql
   SELECT ...
   FROM my_parquet_file_where_timestamp_col_has_a_specified_timezone
   WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z")
   ```
   
   Which currently fails because the arrow type of `to_timestamp` has `None` as the timezone, but the type of of `timestamp_col` has a specific timezone?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rok commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
rok commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884535895


   Agreed with @adamhooper - as I user I would strongly dislike the row 4 behavior as it would make my logic a function of the environment I'd run it on instead of only a function the input.
   
   Keep in mind that timezones can have nonexistent and ambiguous times that need to be handled when "localizing" (converting wall time) to UTC. In arrow there's an [open PR](https://github.com/apache/arrow/pull/10610/files#diff-19248dfd2de25c17622290aae9c78024ba474de2913e44bc8cf0dce0bfe7dcda) for a `tz_localize` kernel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adamhooper commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
adamhooper commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-876420215


   @alamb would anybody be horrified/appalled at the following:
   
   * `to_timestamp()` always returns `TIMESTAMP WITH TIMEZONE` -- i.e., UTC.
   * `my_parquet_file_with_isAdjustedToUTC_true` are always `TIMESTAMP WITH TIMEZONE` -- i.e., UTC. (Parquet doesn't store timezones anyway.)
   * `my_arrow_ipc_file_with_timezone_america_montreal` produces `TIMESTAMP WITH TIMEZONE` -- i.e., UTC. (DataFusion simply drops the metadata, without warning.)
   * `my_parquet_file_with_isAdjustedToUTC_false` produces an error for now, saying DataFusion doesn't support `TIMESTAMP WITHOUT TIME ZONE`.
   * `my_arrow_ipc_file_with_timezone_null` gives a deprecation warning and `TIMESTAMP WITH TIMEZONE`; then, some version in the future, DataFusion re-implements it as `TIMESTAMP WITHOUT TIMEZONE`.
   
   To me, dropping timezone info is an easy (if rude) decision. DataFusion implements a subset of Postgres, and Postgres has no timezone metadata. There's an argument to be made for passing the timezone through unmodified; but that would be hard/convoluted, probably needing new syntax; and does anybody actually want that? IPC with datetime+timezone tuples is nonstandard and heavyweight; Spark doesn't even have a type for that. Forcing UTC won't stop anybody from accomplishing anything.
   
   So the only hairy decision is what to do with Arrow `timezone=null`. Up until today, DataFusion users have been using it as `TIMESTAMP WITH TIMEZONE`, right? If that's the case, then there's great news: nobody who's using DataFusion is using `TIMESTAMP WITHOUT TIMEZONE` :). Sounds like that type would be a new feature ... and I expect it would be a low priority, since nobody asked for it. (Again, Spark has no such type.)
   
   (Primer on Postgres, for the uninitiated: `TIMESTAMP WITH TIME ZONE` means UTC; it doesn't store a timezone.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-874367756


   > I'm not sure I'd call it a "bug" per se, but all of the to_timestamp*() functions output TimestampTypes in Arrow that have a Timezone field of None. The issues here are:
   
   This does seem like an inconsistency with the just agreed upon convention.  I would expect all of these methods to convert to `Timestamp(unit, "UTC")`.
   
   > I'm aware there was a recent vote to treat the None timestamp like a local timestamp, but this isn't always the case either. For example, we have been using the output of None timestamps from DataFusion but we treat all our timestamps as UTC internally.
   
   Correct, I think DataFusion is interpreting `Timestamp(unit, None)` as "instant".  This is still valid, but no longer "conventional" and could cause interoperability problems.  That being said, "local" timestamps don't often get stored or travel between implementations.  As long as DataFusion treats both `Timestamp(unit, None)` and `Timestamp(unit, UTC)` the same then I wouldn't say it is urgent or must-fix.  It's probably a topic for discussion within DF.
   
   > Timestamp types with different timezones are not comparable / compatible types. For example, if I have data with a timezone of Some("UTC") the following fails due to incompatible types: WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z") (because timestamp_col has Some(UTC) but to_timestamp returns None)
   
   I'm not sure if you are saying the are not comparable or they should not be comparable.  They "should" be comparable if you are comparing two columns with a non-None timestamp (e.g. `Timestamp(units, "UTC")` should be comparable with `Timestamp(units, "+0700")` but neither should compare to `Timestamp(units, None)`.  This is going by the newly discussed "conventional" logic where `None` is "local" and `UTC` is "instant.  Since DF has it backwards right now you'd need to invert the logic.
   
   > The best solution I can think of would be for to_timestamp(...) to support a second, optional argument where the timezone can be specified.
   
   As long as it is optional that is probably a good idea.  However, that sounds like it is adding capability and wouldn't fix anything on its own.
   
   > There is also the possibility to simplify things and eliminate the timezone field from Arrow, but that's a really big change. For example, Parquet does not store the timezone but simply has a isAdjustedToUTC field. If True, that means timestamps are UTC, and if not that means local timestamp.
   
   I'm not sure exactly what you are saying here.  `isAdjustedToUTC` basically boils down to `timezone != null` (I might have that backwards).  If you are arguing to avoid using timestamps other than `UTC` or `null` then I agree that is a viable option.  Implementations can always store timestamp info in a separate column or always represent "zoned date times" as structs (given that the only reason to use "zoned date time" in the first place is typically for field extraction).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adamhooper commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
adamhooper commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884933560


   > Theoretically I agree that a two step process is clearest; The challenge is that the output type of `to_timestamp` needs to be known in advance (aka planning time) prior to access to the values of the strings.
   
   This sounds exactly like the problems Postgres faced. Postgres' solutions all have predictable types:
   
   * `SELECT CAST('2021-07-22T9:40:00-04:00' AS TIMESTAMPTZ)` => `2021-07-22 13:40` (zone=UTC)
   * `SELECT CAST('2021-07-22T9:40:00-04:00' AS TIMESTAMP)` => `2021-07-22 09:40` (zone=null)
   * `SELECT '2021-07-22T13:40Z'::TIMESTAMPTZ AT TIME ZONE 'Europe/Paris'` => `2021-07-22 15:40` (zone=null)
   * `SELECT '2021-07-22T15:40'::TIMESTAMP AT TIME ZONE 'Europe/Paris'` => `2021-07-22 13:40` (zone=UTC)
   
   If this kind of grammar is too tricky to implement -- or, heck, if DataFusion wants to be more legible -- these solutions could be rewritten in function-style syntax:
   
   * `TO_TIMESTAMP_UTC()` => input is string, output is Timestamp(UTC). Absence of offsets in the input string means they're interpreted as UTC. (Postgres interprets the absence of offsets as session timezone, not UTC; I'm proposing DataFusion hard-code UTC until it gets session timezone ... which may be never ;).)
   * `TO_TIMESTAMP_LOCALTIME()` => input is string, output is Timestamp(null). Offsets in the input string are ignored.
   * `TIMESTAMP_UTC_TO_LOCALTIME(timestamp_utc, zone)` => input is Timestamp(UTC), output is Timestamp(null).
   * `TIMESTAMP_LOCALTIME_TO_UTC(timestamp_localtime, zone)` => input is Timestamp(null), output is Timestamp(UTC).
   
   Postgres's `TO_TIMESTAMP()` is a quasi-alias for the first item in my lists. It outputs a fixed type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884452349


   Does this seem correct?
   
   
   
   Example | Output Arrow Type | Value | Output when printed
   -- | -- | -- | --
   to_timestamp('2021-07-20') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 local time | 2021-07-20T0
   0:00:00
   to_timestamp('2021-07-20+00:00') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T00:00:00
   to_timestamp('2021-07-20+01:00') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 UTC+1 | 2021-07-19T23:00:00
   to_timestamp('2021-07-20', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 local time | 2021-07-19T00:00:00Z
   to_timestamp('2021-07-20+00:00', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T00:00:00Z
   to_timestamp('2021-07-20+01:00', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T23:00:00Z
   
   I think the one most interesting to me is row 4 so double check that one (also not the presence/absence of a trailing Z to indicate local time or not).
   
   By "output when printed" I mean what an implementation would get if it tried to force the value into an ISO-8601 string.
   
   Just a few extra thoughts...
   
   Does `to_timestamp` only accept offsets in the strings (e.g. `+01:00` and not tzdata strings or things like `MST`)?  For reference, ISO-8601 only deals with offsets.
   
   Will Rust ever generate an Output Arrow Type with a time zone string (e.g. MST)?
   
   My point with the last two questions is that offset conversions aren't too bad if you aren't handling tzdata strings.  Once you do then you need a timezone database which is a bit of a pain but solvable.  The C++ impl recently added compute kernels which rely on a timezone database and it's a non-trivial effort.  @rok has been spearheading this effort so you might reach out if you go down this road.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] velvia commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-881794904


   To add to the above list, to_timestamp(string) should really return a Timestamp Scalar with the proper timezone derived from parsing the string.  Currently no TZ info is returned, but for example if the date time string has "Z" in it, then the resulting scalar timezone should be "UTC".   
   
   The bigger question really is how much effort to pour into all this.  Since the other Arrow implementations and the Arrow spec already specifies TimeZone, and due to many issues mentioned above, it seems to me that at least we should try to get notion of UTC vs local time zone, and be able to maintain that distinction, correctly.  IE, Rust DataFusion should IMHO have first class support of both `Timestamp(_, None)` as well as `Timestamp(_, UTC)`, everywhere.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adamhooper commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
adamhooper commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884445745


   @alamb Could you please write example results? Maybe centered around 1970-01-01 so the math is trivial?
   
   `TO_TIMESTAMP('1970-01-01+01:00')` => will this produce `-3_600_000_000_000`? `0`? Or `+3_600_000_000_000`? (I would expect `-3_600_000_000_000` regardless of second param, because why would the `+01:00` be in the input string otherwise....)
   
   I think allowing `'UTC'` as a second parameter seems not-terrible. (That's a compliment ;).) It's backwards-compatible and not too confusing. You could even plod towards setting it by default, across major versions, by adding a `'local'` magic default value; deprecating NULL; switching the default to `'UTC'`; deprecating the param entirely; then finally dropping the param. I think allowing _other_ (non-`'UTC'`) values would be confusing to a doc-defeating extent (and of questionable value, since Parquet doesn't support other timezones).
   
   `TO_TIMESTAMP('1970-01-01')` => you wrote "local time" in two examples, and I think that's ambiguous because it defines the result in terms of the result. Do you mean the client's localtime? The server's localtime? The session localtime? Or "calendar-date-plus-wall-time" -- e.g., "nanos since unix epoch until 1970-01-01 UTC"? So `TO_TIMESTAMP('1970-01-01')` would always give `0` and `TO_TIMESTAMP('1970-01-01', 'UTC')` would always give `0`?
   
   `TO_TIMESTAMP('1970-01-01+01:00', 'UTC')` => `-3_600_000_000_000`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884461987


   @adamhooper  here is what I had in mind in terms of values. Row 4 is also the most interesting, perhaps. I think this is consistent with what @westonpace has in his table
   
   | Example                                           | Output Arrow Type                    |     Value
   | ------------------------------------------ | ------------------------------------ | ------------------------------------------------------
   | to_timestamp('1970-01-01')                 | `Timestamp(Nanoseconds, None)`       | 0
   | to_timestamp('1970-01-01+00:00')           | `Timestamp(Nanoseconds, None)`       | 0
   | to_timestamp('1970-01-01+01:00')           | `Timestamp(Nanoseconds, None)`       | -3600000000000 (1 hour of nanoseconds)
   | to_timestamp('1970-01-01', 'UTC')          | `Timestamp(Nanoseconds, Utc)`        | 5*3600000000000 (5 hour of nanoseconds) __assuming code was run on a computer set to EDT, UTC-5:00__
   | to_timestamp('1970-01-01+00:00', 'UTC')    | `Timestamp(Nanoseconds, Utc)`        | 0
   | to_timestamp('1970-01-01+01:00', 'UTC')    | `Timestamp(Nanoseconds, Utc)`        | -3600000000000 (1 hour of nanoseconds)
   
   
   > Does to_timestamp only accept offsets in the strings (e.g. +01:00 and not tzdata strings or things like MST)? For reference, ISO-8601 only deals with offsets.
   
   It handles a bunch of different formats (see [source](https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/datetime_expressions.rs#L57-L63)) but mostly trying to 
   ```
   /// Examples of accepted inputs:
   /// * `1997-01-31T09:26:56.123Z`        # RCF3339
   /// * `1997-01-31T09:26:56.123-05:00`   # RCF3339
   /// * `1997-01-31 09:26:56.123-05:00`   # close to RCF3339 but with a space rather than T
   /// * `1997-01-31T09:26:56.123`         # close to RCF3339 but no timezone offset specified
   /// * `1997-01-31 09:26:56.123`         # close to RCF3339 but uses a space and no timezone offset
   /// * `1997-01-31 09:26:56`             # close to RCF3339, no fractional seconds
   ```
   
   > Will Rust ever generate an Output Arrow Type with a time zone string (e.g. MST)?
   
   The Rust arrow implementation already can encode the timezone string but the support in compute kernels and datafusion for a timezone other than `None` is basically non existent
   
   However, in my mind the point of adding a second argument to `to_timestamp` is to allow the construction of output arrow types with timezone strings. The tz database has a rust wrapper which we may have to pull in https://github.com/chronotope/chrono-tz


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] velvia commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-885231416


   Hi folks,
   
   Excellent discussion; I need to go back and reread the detailed scenarios that Andrew and others laid out.
   
   I did propose a second argument to the to_timestamp() functions, though I’m slightly leaning towards having separate functions, a separate one specifically for timezone conversion or designation - mainly for clarity.
   
   To help us think through it though, let’s take the default case where `to_timestamp()` does NOT have a second argument, which most people (definitely those coming from other SQL dialects) would omit.
   
   Arrow/DataFusion must know the output type currently, so we have to pick one.  Say it is Timestamp(_, UTC).  Would this be correct?  (UTC seems to be the only 
   
   The cases are:
   - string, with no clear timezone designation.  In this case, local (server) timezone might be picked to interpret, and then the results could be converted to be UTC-based timestamp.  There is ambiguity though, but one can argue if the original string was UTC it should have “Z” or appropriate offset anyways.
   - string, with timezone designation (+/- offset, or Z).  In this case we get the timezone from the string, we could convert output to be UTC based.
   - int64/uint64.  In this case, this is a numeric timestamp from epoch, but there is again ambiguity.  Do we treat it as local timezone based or UTC based?
   - Timestamp column.  If the source is already a timestamp column, the user is looking to probably convert units (say mills to nanos) and preserve the timezone.  This is a case where having a fixed output type does not really work well, but for this case we can actually specify that DataFusion produce an array with the same timezone output as the input, so it’s fine.
   
   I apologize in advance if I’m rehashing discussions from earlier in the thread still need to catch up.
   
   > On Jul 22, 2021, at 12:20 PM, Andrew Lamb ***@***.***> wrote:
   > 
   > 
   > TO_TIMESTAMP_UTC
   > TO_TIMESTAMP_LOCALTIME
   > Yes, this is what I thought @velvia <https://github.com/velvia> was more or less proposing, but instead of those names, use a second optional argument
   > 
   > to_timestamp(.., 'UTC') --> TO_TIMESTAMP_UTC(..)
   > to_timestamp(..) --> TO_TIMESTAMP_LOCALTIME(..)
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub <https://github.com/apache/arrow-datafusion/issues/686#issuecomment-885171173>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIDPW4W353SLEIUIWPPZMLTZBVOTANCNFSM473NSWHA>.
   > 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] velvia commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-876625232


   @adamhopper personally I'd be OK with your recommendations, but some thoughts:
   - There is some interop considerations between `TIMESTAMP WITH TIMEZONE` ie TimestampType(_, UTC) vs TimestampType(_, None).   The reason is that so far DataFusion has been generating `TimestampType(_, None)`.  If this changes to UTC, which makes sense as far as conventions go, then we need some way for people with data with Timestamp(_, None) to be able to convert it over to (_, UTC).
   - I'm not sure it would be acceptable to throw an error at `isAdjustedToUTC = false` as I'm sure people have Parquet files out there with this, and then DataFusion/Arrow wouldn't be able to read it.   But maybe it's OK if you give people a way to cast / convert the timestamps over.
   
   Also, I see your point about `to_timestamp()` having a timezone argument being confusing.    However there is a second use of `to_timestamp()` right now, which is for conversion purposes, from other types to Timestamp(resolution, tz).  The reason it exists like this is because there is no standard SQL/PG way to work with multiple timestamp resolutions, which are required and supported by all Arrow implementations.    
   
   Thus, while `TO_TIMESTAMP('2021-01-01T00:00:00', 'America/Montreal')` might be confusing, I believe the following would be more clear:
   
   * `to_timestamp_millis(my_timestamp_col, 'UTC')` -> cast my_timestamp_col from int or another timestamp resolution to millisecond resolution, and make sure it has UTC timezone
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-881884131


   > Do the above sound about right?
   
   yes that sounds right
   
   I am very supportive of an effort to have first class (and correct) timestamp support in Rust both for Arrow and DataFusion. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace edited a comment on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace edited a comment on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884452349


   Does this seem correct?
   
   
   
   Example | Output Arrow Type | Value | Output when printed
   -- | -- | -- | --
   to_timestamp('2021-07-20') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 local time | 2021-07-20T00:00:00
   to_timestamp('2021-07-20+00:00') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T00:00:00
   to_timestamp('2021-07-20+01:00') | Timestamp(Nanoseconds, None) | nanos since unix epic until 2021-07-20 UTC+1 | 2021-07-19T23:00:00
   to_timestamp('2021-07-20', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 local time | 2021-07-19T00:00:00Z
   to_timestamp('2021-07-20+00:00', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T00:00:00Z
   to_timestamp('2021-07-20+01:00', 'UTC') | Timestamp(Nanoseconds, Utc) | nanos since unix epic until 2021-07-20 UTC | 2021-07-20T23:00:00Z
   
   I think the one most interesting to me is row 4 so double check that one (also note the presence/absence of a trailing Z to indicate local time or not).
   
   By "output when printed" I mean what an implementation would get if it tried to force the value into an ISO-8601 string.
   
   Just a few extra thoughts...
   
   Does `to_timestamp` only accept offsets in the strings (e.g. `+01:00` and not tzdata strings or things like `MST`)?  For reference, ISO-8601 only deals with offsets.
   
   Will Rust ever generate an Output Arrow Type with a time zone string (e.g. MST)?
   
   My point with the last two questions is that offset conversions aren't too bad if you aren't handling tzdata strings.  Once you do then you need a timezone database which is a bit of a pain but solvable.  The C++ impl recently added compute kernels which rely on a timezone database and it's a non-trivial effort.  @rok has been spearheading this effort so you might reach out if you go down this road.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adamhooper commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
adamhooper commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884461206


   @westonpace Could you please explain what happens on row 4?
   
   The rest of it seems perfect to me. Thank you for answering :).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-874367756






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884896175


   > It feels to me like to_timestamp is doing two steps. First, it is converting to a string to a Timestamp (respecting the offset in the string, if any). Second, it is "reinterpret" casting (schema only change) that timestamp to the desired local/instant semantics. 
   
   Yes I agree that is what it is doing. 
   
   > I'd personally prefer something that breaks things into two steps if needed.
   
   Theoretically I agree that a two step process is clearest; The challenge is that the output type of `to_timestamp` needs to be known in advance (aka planning time) prior to access to the values of the strings. 
   
   Imagine a query that is like `to_timestamp(string_column)` that needs to produce an array that has uniform type. What type is chosen? Assuming the second argument must be a string constant (aka would not allow `to_timestamp(timestamp_column, timezone_column)` the output type can be known with the proposal in https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884461987)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rok commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
rok commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884930875


   > Theoretically I agree that a two step process is clearest; The challenge is that the output type of `to_timestamp` needs to be known in advance (aka planning time) prior to access to the values of the strings.
   
   Isn't output of `to_timestamp` just `timestamp(unit, timezone_string)`? (I'm not familiar with Rust implementation).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] velvia commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-875109908


   It sounds like everyone agrees second, optional argument to `to_timestamp*()` to force a specific timezone result works.
   
   As for casting/coercion, we have:
   
   @alamb 
   > I think a better solution would be for DataFusion to implement coercion rules between date/time/timestamp types. In your example of a timestamp_col with Timestamp(Milliseconds, Some("UTC")), then I would like DataFusion to handle
   
       WHERE timestamp_col > to_timestamp("2021-06-21T12:00Z")
   by adding the following cast (please forgive me the pseudo-SQL)
   
       WHERE timestamp_col > cast (to_timestamp("2021-06-21T12:00Z") as timestamp(UTC))
   
   The above is saying that `Timestamp(_, None)` can be coerced into `Timestamp(_, Some("UTC"))`, is that right?  (AFAICT, we can't just say that only functions called in WHERE clauses that output None be coerced?)   However, a None implies local timestamp, which is not the same as an instant with UTC (although semantically for UrbanLogiq that definitely works. :)
   
   OTOH @westonpace wrote:
   
   > I'm not sure if you are saying the are not comparable or they should not be comparable. They "should" be comparable if you are comparing two columns with a non-None timestamp (e.g. Timestamp(units, "UTC") should be comparable with Timestamp(units, "+0700") but neither should compare to Timestamp(units, None). This is going by the newly discussed "conventional" logic where None is "local" and UTC is "instant. Since DF has it backwards right now you'd need to invert the logic.
   
   It seems to me that coercing Timestamp(_, None) to `Timestamp(_, "UTC")` would not fit the recent convention, at least automatic coercion without context.
   
   So follow up / action items / possibilities:
   
   1.  Should the current output of `to_timestamp(...)` default to `Timestamp(_, Some(UTC))` ?   This sounds like a yes (and would solve my use case)
   2.  `to_timestamp(....)` can be extended to add a second argument to accept different time zones, or None (not sure how to specify None if the default is UTC, per item 1)
   3.  Coercion ... the details of this needs to be figured out, semantics
   4.  Or, simplify everything and have the timestamp types not have timezone, but just if they are UTC or local (similar to Parquet).
   
   BTW I looked up the C/C++ API, and the Python API, and none of those appear to support timezones in the timestamp type, just a unit.   It could be argued that 4) would bring the Rust API closer to the other languages APIs, and when arrow datatypes are translated from Flight or the binary representation, how is the timezone encoded, if at all?   
   
   (if 4 is a worthy avenue to discuss, I can start a email chain)
   
   cheers and thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884535402


   Aside: this case `to_timestamp('2021-07-20', 'UTC')` could be ambiguous if the server time zone happens to be in a DST shift.
   
   > I think this is consistent with what @westonpace has in his table
   
   Then I think what you have proposed matches (as best I can tell) the recently agreed upon spec so that is all good.  Usability-wise, I don't know, but I'll admit I don't have much stake in a usability decision (I'm more worried about interoperability).  So take this next paragraph with a grain of salt :laughing: 
   
   It feels to me like `to_timestamp` is doing two steps.  First, it is converting to a string to a Timestamp (respecting the offset in the string, if any).  Second, it is "reinterpret" casting (schema only change) that timestamp to the desired local/instant semantics.  I'd personally prefer something that breaks things into two steps if needed.
   
   Example | One Step Type | Two Step Equivalent | Two Step Type
   -- | -- | -- | --
   to_timestamp('1970-01-01') | Timestamp(Nanoseconds, None) | reinterpret_zone(to_timestamp('1970-01-01'), None) | Timestamp(Nanoseconds, None)
   to_timestamp('1970-01-01+00:00') | Timestamp(Nanoseconds, Utc) | reinterpret_zone(to_timestamp('1970-01-01+00:00')) | Timestamp(Nanoseconds, None)
   to_timestamp('1970-01-01+01:00') | Timestamp(Nanoseconds, +01:00) | reinterpret_zone(to_timestamp('1970-01-01+01:00')) | Timestamp(Nanoseconds, None)
   to_timestamp('1970-01-01', 'UTC') | Timestamp(Nanoseconds, None) | reinterpret_zone(to_timestamp('1970-01-01'), 'UTC') | Timestamp(Nanoseconds, Utc)
   to_timestamp('1970-01-01+00:00', 'UTC') | Timestamp(Nanoseconds, Utc) | reinterpret_zone(to_timestamp('1970-01-01+00:00', 'UTC'), 'Utc') | Timestamp(Nanoseconds, Utc)
   to_timestamp('1970-01-01+01:00', 'UTC') | Timestamp(Nanoseconds, +01:00) | reinterpret_zone(to_timestamp('1970-01-01+01:00'), 'UTC') | Timestamp(Nanoseconds, Utc)
   
   This would change the meaning of the fourth row so it would have the value 0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace edited a comment on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace edited a comment on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884535402


   Aside: this case `to_timestamp('2021-07-20', 'UTC')` could be ambiguous if the server time zone happens to be in a DST shift.
   
   > I think this is consistent with what @westonpace has in his table
   
   Then I think what you have proposed matches (as best I can tell) the recently agreed upon spec so that is all good.  Usability-wise, I don't know, but I'll admit I don't have much stake in a usability decision (I'm more worried about interoperability).  So take this next paragraph with a grain of salt :laughing: 
   
   It feels to me like `to_timestamp` is doing two steps.  First, it is converting to a string to a Timestamp (respecting the offset in the string, if any).  Second, it is "reinterpret" casting (schema only change) that timestamp to the desired local/instant semantics.  I'd personally prefer something that breaks things into two steps if needed.
   
   Example | One Step Type | Two Step Equivalent | Two Step Type
   -- | -- | -- | --
   to_timestamp('1970-01-01') | Timestamp(Nanoseconds, None) | reinterpret_zone(to_timestamp('1970-01-01'), None) | Timestamp(Nanoseconds, None)
   to_timestamp('1970-01-01+00:00') | Timestamp(Nanoseconds, Utc) | reinterpret_zone(to_timestamp('1970-01-01+00:00'), None) | Timestamp(Nanoseconds, None)
   to_timestamp('1970-01-01+01:00') | Timestamp(Nanoseconds, +01:00) | reinterpret_zone(to_timestamp('1970-01-01+01:00'), None) | Timestamp(Nanoseconds, None)
   to_timestamp('1970-01-01', 'UTC') | Timestamp(Nanoseconds, None) | reinterpret_zone(to_timestamp('1970-01-01'), 'UTC') | Timestamp(Nanoseconds, Utc)
   to_timestamp('1970-01-01+00:00', 'UTC') | Timestamp(Nanoseconds, Utc) | reinterpret_zone(to_timestamp('1970-01-01+00:00', 'UTC'), 'Utc') | Timestamp(Nanoseconds, Utc)
   to_timestamp('1970-01-01+01:00', 'UTC') | Timestamp(Nanoseconds, +01:00) | reinterpret_zone(to_timestamp('1970-01-01+01:00'), 'UTC') | Timestamp(Nanoseconds, Utc)
   
   This would change the meaning of the fourth row so it would have the value 0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-874368411


   > To be specific, if you do to_timestamp(date_string) then it can figure out the timezone from the string, sometimes. However, current architecture of Arrow/DF requires a specific timezone output type.
   If you did to_timestamp(int64) then you would not have any timezone info at all, and would require the input in the second argument no matter what for completeness.
   
   I don't think `to_timestamp(date_string)` should always put the time zone into the data type.  It should either never do so and have a second, optional time zone argument (for consistency with the other `to_timestamp` methods) or it should expose a boolean parameter (`keep_timezone` or something like that).  The time zone is needed in order to figure out the correct instant.  However, it is often not important beyond that.  What happens if you have strings from different time zones?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] velvia commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-874356725


   @alamb @westonpace .... your thoughts on this matter would be really welcome. :)   
   
   To be specific, if you do `to_timestamp(date_string)` then it can figure out the timezone from the string, sometimes.  However, current architecture of Arrow/DF requires a specific timezone output type.
   If you did `to_timestamp(int64)` then you would not have any timezone info at all, and would require the input in the second argument no matter what for completeness.
   
   There is also the possibility to simplify things and eliminate the timezone field from Arrow, but that's a really big change.  For example, Parquet does not store the timezone but simply has a isAdjustedToUTC field.  If True, that means timestamps are UTC, and if not that means local timestamp.  
   https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] westonpace commented on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884538717


   Indeed, the more I look at this the more I wonder why the first row doesn't use the server time but the fourth row does use the server time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] adamhooper edited a comment on issue #686: Specific timezone support for `to_timetamp*()`

Posted by GitBox <gi...@apache.org>.
adamhooper edited a comment on issue #686:
URL: https://github.com/apache/arrow-datafusion/issues/686#issuecomment-884485767


   > to_timestamp('1970-01-01', 'UTC')	`Timestamp(Nanoseconds, Utc)`	5*3600000000000 (5 hour of nanoseconds) **assuming code was run on a computer set to EDT, UTC-5:00**
   
   Oh, I see -- this is existing behavior. But then, does existing behavior define a different result for row 1? https://github.com/apache/arrow-datafusion/blob/5900b4c6829b0bdbe69e1f95fb74e935bc8f33d4/datafusion/src/physical_plan/datetime_expressions.rs#L85
   
   In my view, if the existing behavior has to stay, it has to stay; but if it _doesn't_ have to stay, then as a user I'd feel much more comfortable if DataFusion didn't expose the server's timezone -- or at least made it a session variable, as Postgres does.
   
   Note that we're up to _four_ timezones in every call to `TO_TIMESTAMP()`:
   
   * `NULL`: "localtime" the Arrow concept -- calendar-date-plus-wall-time
   * `UTC`
   * The input-string timezone offset
   * DataFusion server's "local timezone" (distinct from Arrow's "local time", and presumably unknown to my users)
   
   My use case is to run custom SQL from the general public. I realize I can use environment variables to force DataFusion's "server timezone" to be UTC; but that doesn't negate the learning curve.
   
   Postgres has the same complexity. Its syntax:
   
   * `SELECT timestamp_with_time_zone AT TIME ZONE 'America/Eastern'` => converts UTC to localtime
   * `SELECT timestamp_with_time_zone::TIMESTAMP` => converts UTC to session-timezone localtime
   * `SELECT timestamp_without_time_zone AT TIME ZONE 'America/Eastern' => converts localtime to UTC
   * `SELECT timestamp_without_time_zone::TIMESTAMPTZ` => converts localtime to UTC using session-timezone localtime
   
   ... and I'll bet most Postgres misunderstand these concepts for years until they figure it out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org