You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/29 20:50:08 UTC
[GitHub] [arrow-datafusion] alamb opened a new issue, #5781: User defined window functions
alamb opened a new issue, #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781
### Is your feature request related to a problem or challenge?
When implementing our InfuxQL frontend for DataFusion, we found certain functions we would like to express as window functions (like a specialized interpolation function for example)
### Describe the solution you'd like
_No response_
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1579085665
Unless someone beats me to it I plan to start working on a proposed design for this feature in the next few days
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb closed issue #5781: User defined window functions
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5781: User defined window functions
URL: https://github.com/apache/arrow-datafusion/issues/5781
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1581428228
Here is a proposed design.
# LogicalPlan / Expr
[The current (`Expr`) representation of window functions](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/expr.rs#L157):
```rust
#[derive(Clone, PartialEq, Eq, Hash)]
pub enum Expr {
...
/// Represents the call of a window function with arguments.
WindowFunction(WindowFunction),
}
```
Which is [defined](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/expr.rs#L448-L458) like:
```rust
/// Window function
#[derive(Clone, PartialEq, Eq, Hash)]
pub struct WindowFunction {
/// Name of the function
pub fun: window_function::WindowFunction,
/// List of expressions to feed to the functions as arguments
pub args: Vec<Expr>,
/// List of partition by expressions
pub partition_by: Vec<Expr>,
/// List of order by expressions
pub order_by: Vec<Expr>,
/// Window frame
pub window_frame: window_frame::WindowFrame,
}
```
Note that [`window_function::WindowFunction`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/window_function.rs#L35C26-L41) already handles user defined functions (`AggregateUDF`), so I propose to add an addutionl user defined window function variant there:
```rust
/// WindowFunction (in `window_function`):
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum WindowFunction {
/// A built in aggregate function that leverages an aggregate function
AggregateFunction(AggregateFunction),
/// A a built-in window function
BuiltInWindowFunction(BuiltInWindowFunction),
/// A user defined aggregate function
AggregateUDF(Arc<AggregateUDF>),
/// A user defined aggregate function <---- This is NEW
WindowUDF(Arc<WindowUDF>),
}
```
# WindowUDF
`WindowUDF` will be a structure that looks very similar to `AggregateUDF`:
https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/udaf.rs#L30-L40
And similarly to the way that an `AggregateUDF` instantiates an [`Accumulator`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/accumulator.rs#L33) (via [`AccumulatorFunctionImplementation`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/function.rs#L49)) the WindowUDF will need to provide a [`PartitionEvaluator`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/physical-expr/src/window/partition_evaluator.rs#L29-L107) instance for each partition of the data.
```rust
pub trait PartitionEvaluator: Debug + Send {
```
I am a little unclear on certain parts of the `PartitionEvaluator` AP (specifically, the stateful / stateless partition evaluation) and if we can make it straight forward to implement this for datafusion
Looking at how this code is used in the `physical_plan` module, what is needed is something that implements the [`BuiltInWindowFunctionExpr`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/physical-expr/src/window/built_in_window_function_expr.rs#L31). I am not sure yet if it makes sense to have `WindowUDF` do so directly of if another structure would be better
Here are the next steps I plan:
So steps:
- [ ] Add some additional docstrings to `PartitionEvaluator` which will both improve the code, and allow me to sort out some of my questions.
- [ ] Propose renaming `BuiltInWindowFunctionExpr` to `WindowFunctionExpr` to reflect they will be used for more than BuiltIn
- [ ] Propose renaming `BuiltinWindowState` to `WindowState` (in the same theory that it is not related to built ins)
- [ ] Work on a POC / technical spike showing how this might work, including an end to end example
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] stuartcarnie commented on issue #5781: User defined window functions
Posted by "stuartcarnie (via GitHub)" <gi...@apache.org>.
stuartcarnie commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1581627637
Great stuff, @alamb! I did have similar ideas in my own test branch, but I wasn't sure how to replicate `BuiltInWindowFunctionExpr` and `BuiltinWindowState`. Renaming them makes a lot of sense 👍🏻
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1523897374
> Yes, we will be using UDWF too. Is there a design doc we can read and comment on?
There is no document that I know of -- it is probably time to start one
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] doki23 commented on issue #5781: User defined window functions
Posted by "doki23 (via GitHub)" <gi...@apache.org>.
doki23 commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1522606061
How we distinguish UDWF from UDAF since window function already supports AggregateUDF, I mean UDWF is a super set of udaf in window function.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1583105449
While working on https://github.com/apache/arrow-datafusion/pull/6592 a key design question about `WindowUDF` has arisen. Specifically:
# Use existing API / Traits
## What this would mean:
This would mean exposing (at least) [`BuiltInWindowFunctionExpr`](https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/physical-expr/src/window/built_in_window_function_expr.rs#L31) and [`PartitionEvaluator`](https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/physical-expr/src/window/partition_evaluator.rs#L29) directly for people to implement a `WindowUDF`
This would mean making those traits `pub` and part of a WindowUDF and
## Pros
1. This is the same pattern used by [`AggregateUDF`](https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/expr/src/udaf.rs#L30-L40) (TODO link) that exposes the [`Accumulator`](https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/expr/src/accumulator.rs#L33) trait directly to implementors
1. Allow user defined window functions to access all the features and performance as built in window functions (both now and in the future)
## Cons
`BuiltInWindowFunctionExpr` and `PartitionEvaluator` are are fairly complicated (as they support many features) which could make implementing user defined window functions harder
Also, `BuiltInWindowFunctionExpr` and `PartitionEvaluator` would likely need some changes such as:
1. renamed so they weren't called "BuiltIn*"
2. Moving the trait definitions into `datafusion_expr`
3. Update the state management to be extendable (e.g. the enum for `BuiltinWindowState` would have to be updated to to allow user defined state somehow)
# Make new APIs and traits
## What this would mean:
In this case, we would make new traits that are subsets of what is in
`BuiltInWindowFunctionExpr` and `PartitionEvaluator` that users would
implement. Internally we would implement `BuiltInWindowFunctionExpr`
and `PartitionEvaluator` in terms of these new traits.
# Pros
1. We can tailor the API to precisely the needs of user defined window function writers
2. Would allow changing the internal APIs without having to change the WindowUDF API
# Cons
A new API would mean introducing another api layer that needs to be tested and kept up to date.
The new API also might not expose all the functionality of the built in window functions if such functionality was added at a later date
## Discussion
I am going to try and bash out a technical proof of concept of the first approach (exposing the existing APIs) and we can see how it would look. On the balance that is my preferred approach due to the power of the API and the similarities with `AggregateUDF`s
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1523900027
> How we distinguish UDWF from UDAF since window function already supports AggregateUDF, I mean UDWF is a super set of udaf in window function.
@doki23 I am not sure
The current code has this:
https://docs.rs/datafusion-expr/23.0.0/datafusion_expr/window_function/enum.WindowFunction.html
I never really understood why DataFusion makes a distinction between built in aggregate functions and user defined aggregate functions (it would be really nice if all aggregate functions had the same interface)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] ozankabak commented on issue #5781: User defined window functions
Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1581793355
Looks reasonable to me, thanks @alamb
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] ozankabak commented on issue #5781: User defined window functions
Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1522550033
Yes, we will be using UDWF too. Is there a design doc we can read and comment on?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1521519672
FYI @mustafasrepo and @ozankabak I think @stuartcarnie is contemplating what implementing user defined window functions might entail. I am not sure if you have any thoughts on this matter you would like to share
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions
Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1584962514
Here is a proposed API https://github.com/apache/arrow-datafusion/pull/6617 showing how a User Defined Window function would work. Any feedback on the approach would be most appreciated.
While waiting for feedback on https://github.com/apache/arrow-datafusion/pull/6617 I plan to try and simplify some of the window function implementation as some small PRs
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org