You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2023/06/01 16:57:04 UTC

[GitHub] [arrow-rs] tustvold commented on issue #1047: Add `Scalar` / `Datum` support to compute kernels

tustvold commented on issue #1047:
URL: https://github.com/apache/arrow-rs/issues/1047#issuecomment-1572425176

   So I've been playing around with this and the major challenge is avoiding a huge amount of API churn / boilerplate
   
   Take the signature
   
   `add_dyn(a: &dyn Array, b: &dyn Array) -> Result<ArrayRef>`
   
   Its not clear how to convert this to a Datum based model. One option would be
   
   ```
   add_dyn(a: Datum<'_>, b: Datum<'_>) -> Result<ArrayRef>
   ```
   
   Where `Datum` is something like
   
   ```
   enum Datum<'a> {
       Array(&'a dyn Array),
       Scalar(&'a dyn Scalar)
   }
   ```
   
   But this has a couple of issues
   
   * Callsites now have to explicitly wrap there arguments in Datum
   * There is no way to return a scalar
   
   Making `Datum` a trait doesn't help here either, because the specialization rules prevent blanked implementations for both `T: Scalar` and `T: Array`.
   
   Another option would be to make the methods generic, with `impl Into<Datum>`, but this also has downsides of
   
   * Runs into same blanket impl issues as deriving `Datum` trait
   * Kernels resulting in significant additional codegen
   
   Taking a step back I had a potentially controversial thought, **why not just treat a single element array as a scalar array**?
   
   This would have some pretty compelling advantages:
   
   * No changes to type signatures necessary
   * Unary kernels like casting just work with no modification
   * Complete type coverage for no effort
   
   The obvious downside is the representation is not very memory efficient. I think the question boils down to what is the purpose of the scalar representation, is it:
   
   1. To allow more efficient kernels where one side is known to be a scalar, e.g. scalar comparison, etc...
   2. Provide an efficient type-erased representation for row-oriented operations like grouping
   3. Provide efficient scalar operations
   
   My 2 cents is that 2. is a use-case better served by the row representation, and 3. is beyond the scope of a vectorized execution engine, and therefore 1. is the target for this feature. As such I think this is perfectly acceptable approach. The overheads of the slightly less efficient representation will be more than outweighed by the costs of the dynamic dispatch alone.
   
   What do people think?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org