You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/22 17:09:42 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #6746: RFC: Make it easier to call window functions via expression API (and add example)

alamb opened a new pull request, #6746:
URL: https://github.com/apache/arrow-datafusion/pull/6746

   # Which issue does this PR close?
   Related to https://github.com/apache/arrow-datafusion/issues/5781
   
   # Rationale for this change
   @mustafasrepo  suggested adding an example for `DataFrame::window` here: https://github.com/apache/arrow-datafusion/pull/6703#discussion_r1238403292
   
   When I tried to do so it turns out that creating an `Expr::Window` is quite challenging, so I took a step back and tried to clean up the API a little
   
   # What changes are included in this PR?
   
   1. Change `WindowExpr` to be a builder style interface and thus making it easier to construct `Expr::WindowFunction` variants. This is modeled after the `CaseBuilder`
   2. Add the `lead`, `lag`, etc. in `expr_fns` to follow the model of the other types of functions
   
   # Are these changes tested?
   Yes
   
   TOOD: I need to write tests for all the code in `expr_fns` which I will add to https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/dataframe_functions.rs if people like this basic approach
   
   # Are there any user-facing changes?
   yes, the APIs to create window functions are improved


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] RFC: Make it easier to call window functions via expression API (and add example) [datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6746:
URL: https://github.com/apache/datafusion/pull/6746#issuecomment-2096468389

   > I think it might be easier to use with something like
   
   That makes sense to me @timsaucer  -- thank you for the feedback. 
   
   I don't think I will have time to work on this PR / issue for a while, but I tried to give it some visibility in https://github.com/apache/datafusion/issues/10395 and hopefully we can find someone else who is interested in helping make it work


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org
For additional commands, e-mail: github-help@datafusion.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6746: RFC: Make it easier to call window functions via expression API (and add example)

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6746:
URL: https://github.com/apache/arrow-datafusion/pull/6746#discussion_r1238815288


##########
datafusion/core/src/dataframe.rs:
##########
@@ -219,6 +219,23 @@ impl DataFrame {
     }
 
     /// Apply one or more window functions ([`Expr::WindowFunction`]) to extend the schema
+    ///
+    /// ```
+    /// # use datafusion::prelude::*;
+    /// # use datafusion::error::Result;
+    /// # #[tokio::main]
+    /// # async fn main() -> Result<()> {
+    /// let ctx = SessionContext::new();
+    /// let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
+    ///
+    /// // The following is the equivalent of "SELECT FIRST_VALUE(b) OVER(PARTITION BY a)"
+    /// let first_value = first_value(col("b"))

Review Comment:
   The rest of the PR is motivated by trying to write this example. It turned out to be quite a pain to create the right `Expr`.



##########
datafusion/expr/src/expr.rs:
##########
@@ -459,22 +482,40 @@ pub struct WindowFunction {
 }
 
 impl WindowFunction {
-    /// Create a new Window expression
-    pub fn new(
-        fun: window_function::WindowFunction,
-        args: Vec<Expr>,
-        partition_by: Vec<Expr>,
-        order_by: Vec<Expr>,
-        window_frame: window_frame::WindowFrame,
-    ) -> Self {
+    /// Create a new Window expression with the specified argument an
+    /// empty `OVER` clause
+    pub fn new(fun: impl Into<window_function::WindowFunction>, args: Vec<Expr>) -> Self {

Review Comment:
   This is the major change -- so that WindowFunction acts like a `Builder` which can be used to set `partition by`, `order by` and `window_frame` with a more self documenting style



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] RFC: Make it easier to call window functions via expression API (and add example) [datafusion]

Posted by "timsaucer (via GitHub)" <gi...@apache.org>.
timsaucer commented on PR #6746:
URL: https://github.com/apache/datafusion/pull/6746#issuecomment-2090457500

   As a new user, I like this change very much. I found creating lead and lag functions to be difficult to use. Suppose I wanted to call lead but provide the other two optional arguments - shift offset and default value. Would the best way to do that with this proposal to create the `WindowFunction` directly?
   
   ```
   let single_lead = lead(col("mycol"));
   let two_lead_with_default = expr::WindowFunction::new(BuiltInWindowFunction::Lead, vec![col("mycol"), lit(2), lit(mydefault)]);
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org
For additional commands, e-mail: github-help@datafusion.apache.org


Re: [PR] RFC: Make it easier to call window functions via expression API (and add example) [datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6746:
URL: https://github.com/apache/datafusion/pull/6746#issuecomment-2092671954

   > As a new user, I like this change very much. I found creating lead and lag functions to be difficult to use. Suppose I wanted to call lead but provide the other two optional arguments - shift offset and default value. Would the best way to do that with this proposal to create the `WindowFunction` directly?
   
   I think it would look something like this
   
   ```rust
   let single_lead = lead(col("mycol"))
     .
   let two_lead_with_default = lead.call(col("mycol"))
     .offset(lit(2))
     .partition_by(lit(mydefault)]))
     .build()
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org
For additional commands, e-mail: github-help@datafusion.apache.org


Re: [PR] RFC: Make it easier to call window functions via expression API (and add example) [datafusion]

Posted by "timsaucer (via GitHub)" <gi...@apache.org>.
timsaucer commented on PR #6746:
URL: https://github.com/apache/datafusion/pull/6746#issuecomment-2092866964

   I'm looking at the arguments we end up passing. I think `cume_dist` takes no arguments. Lead and lag take anywhere between 1 and 3. `nth_value` takes two. I think it might be easier to use with something like
   
   ```
   pub fn lead(arg: Expr, offset: Option<i64>, default: Option<ScalarValue>) -> expr::WindowFunction {
       let offset = lit(ScalarValue::Int64(offset));
       let default_value = match default {
           Some(v) => lit(v),
           None => lit(ScalarValue::Null),
       };
       expr::WindowFunction::new(BuiltInWindowFunction::Lead, vec![arg, offset, default_value])
   }
   ```
   
   and for nth value it's easier
   
   ```
   pub fn nth_value(arg: Expr, n: i64) -> expr::WindowFunction {
       expr::WindowFunction::new(BuiltInWindowFunction::NthValue, vec![arg, lit(n)])
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org
For additional commands, e-mail: github-help@datafusion.apache.org


Re: [PR] RFC: Make it easier to call window functions via expression API (and add example) [datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6746:
URL: https://github.com/apache/datafusion/pull/6746#issuecomment-2096469207

   (or of course if you have the time it would be an amazing contribution 🙏 )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org
For additional commands, e-mail: github-help@datafusion.apache.org


Re: [PR] RFC: Make it easier to call window functions via expression API (and add example) [datafusion]

Posted by "timsaucer (via GitHub)" <gi...@apache.org>.
timsaucer commented on PR #6746:
URL: https://github.com/apache/datafusion/pull/6746#issuecomment-2098821089

   I'm willing to work on this, but I'd like to wrap up the examples I have going into datafusion-python first. I think those will have more impact. I only have an hour or so a day to work on this, so I can't promise how fast I'll get to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscribe@datafusion.apache.org
For additional commands, e-mail: github-help@datafusion.apache.org