You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/04/01 17:56:55 UTC

[GitHub] [arrow-rs] tustvold opened a new issue, #3999: Improved Temporal and Decimal Arithmetic

tustvold opened a new issue, #3999:
URL: https://github.com/apache/arrow-rs/issues/3999

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   -->
   
   The current support for arithmetic on temporal and decimal types is limited and fairly inconsistent:
   
   * Binary kernels don't preserve input types, causing timezones, decimal scale and precision, etc... to be lost - #3307
   * Timestamp subtraction should return duration -  #3964 
   * Scalar kernels don't support arithmetic involving different types - #2842
   * Timestamps don't support arithmetic involving durations or intervals
   * Decimal arithmetic doesn't compute the correct output scale and precision
   * No support for temporal or decimal arithmetic in statically typed kernels
   * Operations such as adding two timestamps are supported despite having unclear semantics
   * Duplicated code for scalar/non-scalar, checked/unchecked, dyn/static variants of every operation
   * No support for arithmetic between scalars
   
   **Describe the solution you'd like**
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   It is not a fully formed design, but I would like to implement something inspired by the `std::ops` traits used for Rust arithmetic.
   
   In particular something like
   
   ```
   trait Add<Rhs> {
       type Output = T;
       fn add(&self, rhs: &Rhs) -> Result<Self::Output, ArrowError>;
   
       fn add_checked(&self, rhs: &Rhs) -> Result<Self::Output, ArrowError>;
   }
   
   trait Scalar {
       fn data_type(&self) -> &DataType;
   
       fn as_any(&self) -> &dyn Any;
   }
   
   struct PrimitiveScalar<T: ArrowPrimitiveType>{
       value: T::Native,
       data_type: DataType,
   }
   
   impl <T: ArrowPrimitiveType> Scalar for PrimitiveScalar<T> {
   
   }
   
   impl <T: ArrowPrimitiveType> Add<PrimitiveArray<T>> for PrimitiveArray<T> {
       ...
   }
   
   impl <T: ArrowPrimitiveType> Add<PrimitiveScalar<T>> for PrimitiveArray<T> {
       ...
   }
   
   impl <T: ArrowPrimitiveType> Add<PrimitiveScalar<T>> for PrimitiveScalar<T> {
       ...
   }
   
   impl Add<dyn Array> for dyn Array {
       ...
   }
   
   impl Add<dyn Scalar> for dyn Array {
       ...
   }
   
   impl Add<dyn Scalar> for dyn Scalar {
       ...
   }
   ```
   
   Aside from solving the above problems, it would make for a more idiomatic design that is easier to extend and maintain. We would then deprecate and remove the old kernels.
   
   This would also be a pattern that could be translated to support other kernels that run into similar issues with scalars, e.g. arrow-ord and arrow-string.
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features you've considered.
   -->
   
   We could drop support for arithmetic with statically typed arguments, and add further kernels for performing arithmetic on `dyn Scalar` 
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493437389

   Using the dyn variants will yield that behaviour


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493423452

   The other thing I wanted to mention was that we already have some non trivial changes in flight (e.g ArrayData, etc) and one thing I think we should try to is wrap up outstanding refactorings before we take on another substantial refactor


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493416606

   > Having a kernel like add(array: &dyn Array, scalar: &Scalar) is certainly compelling
   
   The eventual idea would be to not have such a kernel at all, instead you would just call `array.add(scalar)` and let trait-resolution work out what variant to call based on the types you give it.
   
   > One thing I think would be a step back was if the traits were complicated enough to express things like can't add PrimitiveType to TimestampType -- because I think that would generate very hard to understand compiler error messages.
   
   I think the errors are fairly clear? - https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e5e12eacb2b9041f988afe2012637a74
   
   > why use a trait for Scalar rather than an enum
   
   Mainly for consistency with the use of `dyn Array`. We need a strongly typed `PrimitiveScalar` or similar, but whether you wrap that up in a trait object or an enum for the dyn dispatch is largely immaterial, you will end up a large match block either way.
   
   > Do we plan to remove all typed kernels
   
   The intention is for the traits to be implemented for the concrete types, filling the role of the "typed kernels", whilst also being implemented for the trait objects filling the role of the "dyn kernels". The trait-dispatch machinery in Rust will automatically work out which one is appropriate to call.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493426760

   > would be so many impls that the ones I cared about (for adding timestamps) would not make it into the actual message
   
   It would appear the cut off is 8, which I don't think we will exceed for any type
   
   https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=dd81152086cab213a8ff1d3286aaa28c
   
   Even then I think this is still better than the current approach of using method names to define overloads, at least the compiler tries to help you, instead of always forcing you to search the docs for the specific method signature you want?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493431272

   >  but that is a personal preference
   
   Perhaps you might articulate what your personal preference would be for?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493428878

   > Even then I think this is still better than the current approach of using method names to define overloads, at least the compiler tries to help you
   
   I am not sure I agree that the compiler's help is better in this case, but that is a personal preference. I certainly don't oppose this proposal on this grounds, but I wanted to explore it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493424056

   > The other thing I wanted to mention was that we already have some non trivial changes in flight
   
   100% agree, I intend to have these done in time for the next release at the end of the coming week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493974485

   > I'm not sure if such trait-based dispatch could bring performance penalty even it is slight
   
   Trait-based dispatch is zero-cost, and behaves the same as normal function dispatch. Trait-object dispatch, i.e. using `dyn` is not zero-cost, however, it is unavoidable if you don't know the types statically. TLDR it should be no different than currently, and for cases where the static types are known, it will be faster
   
   > special needs for computation
   
   Yes, we will need to keep an eye out for this. I'm optimistic we can get by with traits for the type of operation, i.e. Divide, Add, Substract, etc... with checked and unchecked member functions. Divide might need a third for divide by zero => NULL, but that is perfectly fine imo


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #3999: Improved Temporal and Decimal Arithmetic
URL: https://github.com/apache/arrow-rs/issues/3999


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] viirya commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "viirya (via GitHub)" <gi...@apache.org>.
viirya commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493454958

   I think for a trait approach, the advantage would be a cleaner implementation. It might be easier to maintain for long term. The concerns I have are basically: 1. performance. I'm not sure if such trait-based dispatch could bring performance penalty even it is slight. 2. special needs. Adding arcane method names is because there are special needs for computation (e.g. checked, opt, overflow-allowing, etc). In order to match such requirements, I think one trait is not necessary but it will lead to some special traits for special need. So eventually we might still have arcane trait names 😂 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493424522

   > What are your thoughts on these error messages?
   
   > https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=e5e12eacb2b9041f988afe2012637a74
   
   
   ```
   = note: expected reference `&PrimitiveArray<DurationType>`
                 found reference `&PrimitiveArray<TimestampNanosecondType>`
   note: associated function defined here
     --> src/main.rs:5:8
      |
   5  |     fn add(&self, rhs: &Rhs) -> Result<Self::Output, String>;
   ```
   ✅  This one seems good
   
   
   > https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=6550efd10b3bf3ec7e7806de71fc6d0e
   
   ```
   = help: the following other types implement trait `Add<Rhs>`:
                <PrimitiveArray<TimestampNanosecondType> as Add<PrimitiveArray<DurationType>>>
                <PrimitiveArray<TimestampNanosecondType> as Add<PrimitiveArray<IntervalType>>>
   ```
   
   This one is also ok given there are only two options and they are probably what the user wants. However, I think in reality there would be so many impls that the ones I cared about (for adding timestamps)  would not make it into the actual message (they would be hidden behind `...`) and that is 👎 as  then I would have to go spelunking into the arrow type system to figure out what was going on.
   
   Maybe this could be handled with sufficiently good and detailed examples and messages. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493062505

   @alamb @viirya I would be interested to hear your thoughts on this, I _think_ trait-based dispatch would be a much cleaner way to handle method overloading than our current approach of increasingly arcane method names, but I'm not sure if I've just been marinating in Rust's generics for too long :sweat_smile: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493436656

   > Perhaps you might articulate what your personal preference would be for?
   
   I guess I would prefer a runtime error with a clear error message that "timestamp + timestamp is not supported"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3999: Improved Temporal and Decimal Arithmetic

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #3999:
URL: https://github.com/apache/arrow-rs/issues/3999#issuecomment-1493410568

   Having a kernel like `add(array: &dyn Array, scalar: &Scalar)` is certainly  compelling
   
   I agree traits would likely be nicer than arcane names.
   
   One thing I think would be a step back was if the traits were complicated enough to express things like `can't add PrimitiveType to TimestampType` -- because I think that would generate very hard to understand compiler error messages. 
   
   
   
   # why use a trait for `Scalar` rather than an `enum`? 
   
   The seems the main usescases for trait are:
   1. Users can supply their own implementations
   2. Can let the compiler generate specialized code. 
   
   
   Maybe it would more naturally extend to nested types 🤔  but I am not sure how one would add a `StructArray` ...
   
   ```rust
   enum Scalar {
     Int8(..)
   ...}
   ```
   
   # Do we plan to remove all typed kernels?
    I wonder if you are thinking about deprecating all the strongly typed kernels in favor of the dyn kernels?
   
   I seems like the dyn kernels would be easier to use for general purpose computing, though would increase binary code size (e.g. anything linking with arrow-rs would get all the arithmetic kernels even if it only used UInt64 for example)
   
   🤔  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org