You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/03/29 20:50:08 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #5781: User defined window functions

alamb opened a new issue, #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781

   ### Is your feature request related to a problem or challenge?
   
   When implementing our InfuxQL frontend for DataFusion, we found certain functions we would like to express as window functions (like a specialized interpolation function for example)
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1579085665

   Unless someone beats me to it I plan to start working on a proposed design for this feature in the next few days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #5781: User defined window functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5781: User defined window functions
URL: https://github.com/apache/arrow-datafusion/issues/5781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1581428228

   Here is a proposed design.
   
   # LogicalPlan / Expr
   
   [The current (`Expr`) representation of window functions](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/expr.rs#L157):
   
   ```rust
   #[derive(Clone, PartialEq, Eq, Hash)]
   pub enum Expr {
   ...
       /// Represents the call of a window function with arguments.
       WindowFunction(WindowFunction),
   }
   ```
   
   Which is [defined](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/expr.rs#L448-L458) like:
   
   ```rust
   /// Window function
   #[derive(Clone, PartialEq, Eq, Hash)]
   pub struct WindowFunction {
       /// Name of the function
       pub fun: window_function::WindowFunction,
       /// List of expressions to feed to the functions as arguments
       pub args: Vec<Expr>,
       /// List of partition by expressions
       pub partition_by: Vec<Expr>,
       /// List of order by expressions
       pub order_by: Vec<Expr>,
       /// Window frame
       pub window_frame: window_frame::WindowFrame,
   }
   ```
   
   Note that [`window_function::WindowFunction`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/window_function.rs#L35C26-L41) already handles user defined functions (`AggregateUDF`), so I propose to add an addutionl user defined window function variant there:
   
   ```rust
   /// WindowFunction (in `window_function`):
   #[derive(Debug, Clone, PartialEq, Eq, Hash)]
   pub enum WindowFunction {
       /// A built in aggregate function that leverages an aggregate function
       AggregateFunction(AggregateFunction),
       /// A a built-in window function
       BuiltInWindowFunction(BuiltInWindowFunction),
       /// A user defined aggregate function
       AggregateUDF(Arc<AggregateUDF>),
       /// A user defined aggregate function  <---- This is NEW
       WindowUDF(Arc<WindowUDF>),
   }
   ```
   
   # WindowUDF
   
   
   `WindowUDF` will be a structure that looks very similar to `AggregateUDF`:
   
   https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/udaf.rs#L30-L40
   
   And similarly to the way that an `AggregateUDF` instantiates an [`Accumulator`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/accumulator.rs#L33) (via [`AccumulatorFunctionImplementation`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/expr/src/function.rs#L49)) the WindowUDF will need to provide a [`PartitionEvaluator`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/physical-expr/src/window/partition_evaluator.rs#L29-L107) instance for each partition of the data.
   
   ```rust
   pub trait PartitionEvaluator: Debug + Send {
   ```
   
   I am a little unclear on certain parts of the `PartitionEvaluator` AP (specifically, the stateful / stateless partition evaluation) and if we can make it straight forward to implement this for datafusion
   
   Looking at how this code is used in the `physical_plan` module, what is needed is something that implements the [`BuiltInWindowFunctionExpr`](https://github.com/apache/arrow-datafusion/blob/1d3860dc813725b5a987aae4c3a8f4a7b2bfdb2d/datafusion/physical-expr/src/window/built_in_window_function_expr.rs#L31). I am not sure yet if it makes sense to have `WindowUDF` do so directly of if another structure would be better
   
   Here are the next steps I plan:
   
   So steps:
   - [ ] Add some additional docstrings to `PartitionEvaluator` which will both improve the code, and allow me to sort out some of my questions.
   - [ ] Propose renaming `BuiltInWindowFunctionExpr` to `WindowFunctionExpr` to reflect they will be used for more than BuiltIn
   - [ ] Propose renaming `BuiltinWindowState` to `WindowState` (in the same theory that it is not related to built ins)
   - [ ] Work on a POC / technical spike showing how this might work, including an end to end example
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] stuartcarnie commented on issue #5781: User defined window functions

Posted by "stuartcarnie (via GitHub)" <gi...@apache.org>.
stuartcarnie commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1581627637

   Great stuff, @alamb! I did have similar ideas in my own test branch, but I wasn't sure how to replicate `BuiltInWindowFunctionExpr` and `BuiltinWindowState`. Renaming them makes a lot of sense 👍🏻 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1523897374

   >  Yes, we will be using UDWF too. Is there a design doc we can read and comment on?
   
   There is no document that I know of -- it is probably time to start one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] doki23 commented on issue #5781: User defined window functions

Posted by "doki23 (via GitHub)" <gi...@apache.org>.
doki23 commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1522606061

   How we distinguish UDWF from UDAF since window function already supports AggregateUDF, I mean UDWF is a super set of udaf in window function.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1583105449

   While working on https://github.com/apache/arrow-datafusion/pull/6592 a key design question about `WindowUDF` has arisen. Specifically:
   
   # Use existing API / Traits
   
   ## What this would mean:
   
   This would mean exposing (at least)  [`BuiltInWindowFunctionExpr`](https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/physical-expr/src/window/built_in_window_function_expr.rs#L31) and [`PartitionEvaluator`](https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/physical-expr/src/window/partition_evaluator.rs#L29) directly for people to implement a `WindowUDF` 
   
   This would mean making those traits `pub` and part of a WindowUDF and 
   
   ## Pros
   
   1. This is the same pattern used by [`AggregateUDF`](https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/expr/src/udaf.rs#L30-L40) (TODO link) that exposes the [`Accumulator`](https://github.com/apache/arrow-datafusion/blob/d689daf83f387d7cf1895b2fad6b347d216e47fc/datafusion/expr/src/accumulator.rs#L33) trait directly to implementors
   1. Allow user defined window functions to access all the features and performance as built in window functions (both now and in the future)
   
   ## Cons
   
   `BuiltInWindowFunctionExpr` and `PartitionEvaluator` are are fairly complicated (as they support many features) which could make implementing user defined window functions harder
   
   Also, `BuiltInWindowFunctionExpr` and `PartitionEvaluator` would likely need some changes such as:
   1. renamed so they weren't called "BuiltIn*"
   2. Moving the trait definitions into `datafusion_expr`
   3. Update the state management to be extendable (e.g. the enum for `BuiltinWindowState` would have to be updated to to allow user defined state somehow)
   
   
   
   # Make new APIs and traits
   
   ## What this would mean:
   
   In this case, we would make new traits that are subsets of what is in
   `BuiltInWindowFunctionExpr` and `PartitionEvaluator` that users would
   implement. Internally we would implement `BuiltInWindowFunctionExpr`
   and `PartitionEvaluator` in terms of these new traits.
   
   # Pros
   
   1. We can tailor the API to precisely the needs of user defined window function writers
   2. Would allow changing the internal APIs without having to change the WindowUDF API
   
   # Cons
   A new API would mean introducing another api layer that needs to be tested and kept up to date.
   
   The new API also might not expose all the functionality of the built in window functions if such functionality was added at a later date
   
   
   
   
   ## Discussion
   
   I am going to try and bash out a technical proof of concept of the first approach (exposing the existing APIs) and we can see how it would look. On the balance that is my preferred approach due to the power of the API and the similarities with `AggregateUDF`s
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1523900027

   > How we distinguish UDWF from UDAF since window function already supports AggregateUDF, I mean UDWF is a super set of udaf in window function.
   
   @doki23 I am not sure
   
   The current code has this: 
   https://docs.rs/datafusion-expr/23.0.0/datafusion_expr/window_function/enum.WindowFunction.html
   
   I never really understood why DataFusion makes a distinction between built in aggregate functions and user defined aggregate functions (it would be really nice if all aggregate functions had the same interface)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on issue #5781: User defined window functions

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1581793355

   Looks reasonable to me, thanks @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ozankabak commented on issue #5781: User defined window functions

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.
ozankabak commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1522550033

   Yes, we will be using UDWF too. Is there a design doc we can read and comment on?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1521519672

   FYI @mustafasrepo  and @ozankabak  I think @stuartcarnie  is contemplating what implementing user defined window functions might entail. I am not sure if you have any thoughts on this matter you would like to share


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on issue #5781: User defined window functions

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #5781:
URL: https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1584962514

   Here is a proposed API https://github.com/apache/arrow-datafusion/pull/6617 showing how a User Defined Window function would work. Any feedback on the approach would be most appreciated.
   
   While waiting for feedback on https://github.com/apache/arrow-datafusion/pull/6617 I plan to try and simplify some of the window function implementation as some small PRs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org