You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/22 12:54:20 UTC

[GitHub] [arrow-rs] alamb opened a new issue #597: Can not create a Timestamp array that has a specified timezone

alamb opened a new issue #597:
URL: https://github.com/apache/arrow-rs/issues/597


   **Describe the bug**
   I can not create a `PrimitiveArray` that has a `DataType::Timestamp` with a timezone other than `None` 
   
   **To Reproduce**
   I am trying to pretty print an array that has timestamps using UTC time.
   
   So instead of `1970-01-01 00:00:00.000000100` I want to print something more like the following (note the Z, in RFC3339): `1970-01-01T00:00:00.000000100Z`
   
   To do so I figured I would "simply" create a new array that had a Timestamp type with the timezone set to UTC
   
   `TimestampNanosecondArray` is defined like this:
   ```rust
   pub type TimestampNanosecondArray = PrimitiveArray<TimestampNanosecondType>;
   ```
   
   And `TimestampNanosecondType` is (effectively) something like
   ```rust
   struct TimestampNanosecondType {}
   impl ArrowPrimitiveType for TimestampNanosecondType {
     type Native = i64;
     const DATA_TYPE = DataType::Timestamp(TimeUnit::Nanosecond, None)
   }
   ```
   
   So I figured I would make a `PrimitiveArray<SOME_TYPE_THAT_HAD_MY_TIMEZONE_SPECIFIED>`
   
   ```rust
   struct TimestampUtcNanosecondType {}
   impl ArrowPrimitiveType for TimestampUtcNanosecondType {
       type Native = i64;
       const DATA_TYPE: DataType = DataType::Timestamp(TimeUnit::Nanosecond, Some("+00:00".to_string()));
   }
   ```
   
   Uhoh! The compiler doesn't like that because it needs a `const String` which you can't have....
   ```
   error[E0015]: calls in constants are limited to constant functions, tuple structs and tuple variants
     --> src/main.rs:38:80
      |
   38 |     const DATA_TYPE: DataType = DataType::Timestamp(TimeUnit::Nanosecond, Some("+00:00".to_string()));
      |                                                                                ^^^^^^^^^^^^^^^^^^^^
   ```
   
   So I conclude it is basically impossible to create an array with a timezone other than `None`
   
   **Expected behavior**
   I expect to be able to create an array using a timezone
   
   **Proposal**
   
   Proposal:
   
   Change `DataType::Timestamp` from
   ```rust
   Timestamp(TimeUnit, Option<String>),
   ```
   to
   ```rust
   Timestamp(TimeUnit, Option<&static str>),
   ```
   
   
   **Additional context**
   FWIW I also tried using `lazy_static` but it doesn't provide a `const String` (only a `static String`)
   
   I view this as a potential first step towards handling timestamps properly in arrow and datafusion: https://github.com/apache/arrow-datafusion/issues/686


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] velvia commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-885243364


   @alamb PRs/issues which are related to this and other timezone issues and I plan to submit soon-ish:
   - Adding proper timezone support for `ScalarValue`, without which many things in SQL with timezones just don't work
   - Having arrays with timezones be properly propagated through various kernels.  Simple fixes but really important
   - Having Debug actually print the actual timezone, not the None implied by the type, for timestamp arrays
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] rok commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
rok commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-885596294


   @velvia
   > * I actually think if we were to change the underlying timezone data type, it should be a numeric offset, not a string.  Strings take up a large amount of space, at least 24 bytes on x86 + the actual storage of the string, and are slow;  furthermore every time we need to actually translate it to the offset, which is more likely what we care about for computations.  I'd prefer some kind of `TimeOffset` which could simply be `Option<i32>` or something like that - that's only 5 bytes. :)
   
   Offsets are much more practical indeed but don't cover things like daylight savings changes and human readable timezones (e.g. EST, PST, CET). But I don't know what your scope is and offsets might be enough.
   
   [Chrono-tz](https://github.com/chronotope/chrono-tz/) seems to implement a timezone database in case you decide to look into that direction.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] jorgecarleitao commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-886022683


   I do not see what is the concern with `String`. It represent exactly what the spec expects; an owned utf8.
   
   The root cause here seems to be the trait of the generic of the `PrimitiveArray`, that expects a type with a compiled-defined data type, which only works for small, finite-sized datatypes. Even the decimal array could have been represented as a `PrimitiveArray`.
   
   Arrow2 solves this by decoupling the generic on the PrimitiveArray from the (logical) `DataType`, exactly to natively support timestamps with timezones.
   
   I would recommend bridging this gap by refactoring the code to not have the `DATA_TYPE` as the constant of the trait.
   
   AFAI understand `&'static` forbids people from declaring an offset based on some transformation. E.g. a field composed by something like `"some_garbage_{offset}"` can no longer be parsed into an offset without matching for all cases, since the "_offset" will not be `'static`.
   
   Having a generic for every possible timezone significantly bloats the binary, as it requires all generics to be re-compiled for every declared timezone variation, and there are a lot of them. The typical problem will be in `match`ing the `DataType` and having to have one branch per Timezone, which requires a significant number of cases.
   
   Generics are used to support different physical in-memory representations of a struct, not to declare semantics. Semantics based on generics is a recipe for large binaries and/or a large number of matches in downcasting trait objects.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] rok commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
rok commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-884938054


   [C++ uses a string](https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.h#L1257) but I'm not familiar with rust implementation at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-885821148


   @velvia  I suggest we ensure the Rust implementation can interoperate with the timezone / time definition (e.g. use a string as defined here: https://github.com/apache/arrow-rs/blob/master/format/Schema.fbs#L226-L246) That string representation also allows for numeric offsets like `"+01:00"` or named timezones). 
   
   If `str` manipulation is too slow, perhaps we can use an enum like
   ```rust
   enum Timezone {
     /// UTC offset, in minutes (?tbd units)
     Offset(i32)
     /// character with the meaning from arrow definition
     name(&'static str)
   }
   ```
   
   > @alamb there is actually a way to create primitive timezone arrays in other timezones.
   
   That is quite cool -- I did not realize that ๐Ÿ‘  It is still somewhat strange because the `TimestampNanosecondArray` is effectively "typed" to a `DataType::Timestamp(TimeUnit::Nanoseconds, None)` but the actual instance may have a non-None timestamp, which seems confusing
   
   > Adding a ref/pointer will require adding lifetimes to type annotations, which would be really annoying and a huge global change
   
   That is why I proposed using `'static` (aka so that we didn't have to plumb through lifetimes everywhere) - but it might be annoying to arrange for static strings 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-886036048


   Good points @jorgecarleitao  and @velvia  -- it sounds like my challenge / problem with `TimestampNanosecondArray` will be solved when we bring in the ideas of arrow2. If it becomes a problem I can look into removing `DATA_TYPE`  as Jorge suggests.
   
   One thing I have noticed that might warrant more thought about something other than `String` for storing timezones is that the `DataType` struct is copied around a lot in arrow code. Maybe something more like `Arc<str>` would be appropriate if we ever want to change the type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] velvia commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-885241595


   @alamb there is actually a way to create primitive timezone arrays in other timezones.  I have some code which I'll be PR'ing against both Arrow and DataFusion which does this, but includes other things we'll need.  However, something like this:
   
   ```
   let data = vec![Some(.....), Some(....)];
   let array_utc = TimestampNanosecondArray::from_opt_vec(data, Some("UTC".to_owned()));
   ```
   
   You can use from_vec as well and it works too.   I found other ways of creating timezone-based arrays as well.
   
   You don't need to create a type which has non-None timezone (yes I'm aware of that limitation and was going to point it out too).  The key is that the underlying ArrayData has the correct, actual timezone, and this is what is returned by the data_type() calls and checked dynamically.  There is an inconsistency there though, which I agree with, and solving that needs more discussion.
   
   Regarding switching to `&str`, some thoughts:
   - Adding a ref/pointer will require adding lifetimes to type annotations, which would be really annoying and a huge global change
   - I actually think if we were to change the underlying timezone data type, it should be a numeric offset, not a string.  Strings take up a large amount of space, at least 24 bytes on x86 + the actual storage of the string, and are slow;  furthermore every time we need to actually translate it to the offset, which is more likely what we care about for computations.  I'd prefer some kind of `TimeOffset` which could simply be `Option<i32>` or something like that - that's only 5 bytes. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] velvia commented on issue #597: Can not create a TimestampNanosecondArray that has a specified timezone

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-886108157


   I would definitely agree that we donโ€™t want a specific type per timezone and that the direction of arrow2 is the right one.
   
   @jorgecarleitao the concern with using Strings is twofold:
   
   1) Parsing cost.  It means any timezone manipulation first requires parsing the string, and that has to be done for every single Array since each one owns its own String and could be different.  
   2) Storage and serialization costs.
   
   Using Arc<> and friends would help at least number 2, but not number 1.
   
   For fast data processing, a solution like the enum that Andrew proposed, or an enum with different timezones, for internal storage would solve both problems.  One can remain compatible with the spec by still taking in a string in APIs, but allow storage and processing to be optimized.
   
   
   > On Jul 24, 2021, at 3:56 AM, Andrew Lamb ***@***.***> wrote:
   > 
   > 
   > Good points @jorgecarleitao <https://github.com/jorgecarleitao> and @velvia <https://github.com/velvia> -- it sounds like my challenge / problem with TimestampNanosecondArray will be solved when we bring in the ideas of arrow2. If it becomes a problem I can look into removing DATA_TYPE as Jorge suggests.
   > 
   > One thing I have noticed that might warrant more thought about something other than String for storing timezones is that the DataType struct is copied around a lot in arrow code. Maybe something more like Arc<str> would be appropriate if we ever want to change the type.
   > 
   > โ€”
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub <https://github.com/apache/arrow-rs/issues/597#issuecomment-886036048>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIDPWYIKFKWZKSV35S6HZLTZKL6LANCNFSM5AZ6KW5Q>.
   > 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] velvia commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-886016326


   @alamb I agree that enum would cover all existing use cases.  Still I'm curious, isn't there any restriction on what that string can be?  In reality in order to perform any timezone adjustments, that string should really conform to one of the standard timezone abbreviations or fully qualified timezone names, or offset - otherwise it cannot be useful.   
   
   Ideally it would be like an enum for each of the possible timezones in the IANA database, for which `chronos-tz` has Timezone implementations.   
   
   If we leave it as a string, there may be timezones which cannot be acted on....  I guess this is likely to be the case.  Having the offset though (or a timezone enum) offers the possibility of faster translation.
   
   Agreed `&static` would work without lifetimes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] velvia commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
velvia commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-885773327


   @rok 
   
   > Offsets are much more practical indeed but don't cover things like daylight savings changes and human readable timezones (e.g. EST, PST, CET). But I don't know what your scope is and offsets might be enough.
   
   The scope is an internal, compact representation to 1) facilitate easy and fast computation, 2) minimize memory and storage costs, 3) minimize network serialization overhead in Ballista.
   
   It should be easily convertible to string representation.  It is true that numeric offsets would lose some information, perhaps there are better representations out there, like an enum from chronos or something.
   
   At the same time, this is also a deep rabbit hole and trying to be completely accurate might also not be practical....


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #597: Can not create a Timestamp array that has a specified timezone

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #597:
URL: https://github.com/apache/arrow-rs/issues/597#issuecomment-884889434


   @jorgecarleitao  / @Dandandan  / @GregBowyer  / @velvia  can you think of something I have missed here? Is it a reasonable proposal to switch to using `&static string` for the timezone specifier?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org