You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/16 17:56:09 UTC

[GitHub] [arrow] jorgecarleitao edited a comment on pull request #7967: ARROW-9751: [Rust] [DataFusion] Allow UDFs to accept multiple data types per argument

jorgecarleitao edited a comment on pull request #7967:
URL: https://github.com/apache/arrow/pull/7967#issuecomment-674555985


   > I reviewed the logic carefully in this PR and I think overall it is quite good. Nice work @jorgecarleitao. The only thing I personally think is needed before it would be ready to merge are the following tests:
   > 
   >     1. End-to-end test (in sql.rs) using a UDF with multiple possible type signatures
   > 
   >     2. Unit tests for the the coercion logic
   > 
   > 
   > All the rest of the stuff in this PR is suggestions.
   > 
   > Along with the test, it might be cool to write up some docs / comments that shows how to write a UDF that actually accepts multiuple types)-- the only [test I see now](https://github.com/apache/arrow/pull/7967/files#diff-8273e76b6910baa123f3a25a967af3b5L1237) has a single set of argument types.
   > 
   > In this design, the UDFs are effectively "polymorphic" in the sense that they can accept multiple different argument types and will have to dispatch at runtime.
   > 
   > Another potential design for UDFs is to provide each UDF with an alias that could be duplicated and a single argument type (e.g `sqrt --> sqrt_32(f32)` and `sqrt --> sqrt_64(f64)`). Then an optimizer / coercion pass would have the logic to resolve the `sqrt` alias to `sqrt_32` or `sqrt_64` depending on the input argument types.
   > 
   > This approach might be marginally faster as the input types would be resolved once at planning time rather than during runtime. However, given that datafusion has a vectorized executor (e.g. the types would be handled once per RecordBatch) the overhead of runtime dispatch will likely not be noticable.
   
   Thank you again @alamb for taking the time to review this. I agree with all you said.
   
   My other two PRs effectively add support for polymorphic functions (including return type). The reason being that we are already doing that for our own udfs, with the function `data_type()/get_type()`, both at the physical and logical level. This is intrinsic to our execution model that requires downcasting and builders inside the function. Since we require the developer to go through that pain, we may as-well just offer them the full flexibility of that. I agree that there is a small overhead, but IMO small compared to execution, and we can always use some form of cached_attribute if we start seeing performance issues in the planning phase.
   
   As an example of the impressive flexibility we get, I was able to run Python lambdas inside a data-fusion UDF in a [pet project of mine](https://github.com/jorgecarleitao/datafusion-python), and convert the result back to numpy arrays, which IMO is mind blowing.
   
   Changes from last time:
   
   * incorporated almost all your suggestions
   * added end-to-end tests
   * added tests to both internal functions of type coercion
   
   I think that the logic of `get_supertype` is not entirely correct atm (e.g. utf8 can be converted to all types), but we have an issue tracking that.
   
   Currently I do struggle to know exactly where to place documentation for this. IMO this is not something to place on the API documentation, but rather on a user-guide. I think that we need to take some time to create placeholders in our `docs/` for these. Pinging @andygrove for ideias also.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org