You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/07 20:51:47 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #6592: Add additional docstrings to Window function implementations

alamb opened a new pull request, #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592

   # Which issue does this PR close?
   
   Related to #5781 
   
   # Rationale for this change
   
   As I begin working out how to expose User Defined Window Functions, I want to 
   1. Make sure the documentation is up to date
   2. Make sure I understand the code sufficiently
   
   # What changes are included in this PR?
   1. Add a bunch of doc comments to explain various window related functionality better
   2. Make `PartitionEvaluator` public
   
   # Are these changes tested?
   Yes, with CI
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   3. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   better rustdocs, hopefully
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6592: Minor: Add additional docstrings to Window function implementations

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592#discussion_r1224191465


##########
datafusion/physical-expr/src/window/built_in_window_function_expr.rs:
##########
@@ -56,19 +62,36 @@ pub trait BuiltInWindowFunctionExpr: Send + Sync + std::fmt::Debug {
             .collect()
     }
 
-    /// Create built-in window evaluator with a batch
+    /// Create a [`PartitionEvaluator`] for evaluating the function on
+    /// a particular partition.
     fn create_evaluator(&self) -> Result<Box<dyn PartitionEvaluator>>;
 
-    /// Construct Reverse Expression that produces the same result
-    /// on a reversed window.  For example `lead(10)` --> `lag(10)`
+    /// Construct a new [`BuiltInWindowFunctionExpr`] that produces
+    /// the same result as this function on a window with reverse
+    /// order. The return value of this function is used by the
+    /// DataFusion optimizer to avoid resorting the data when

Review Comment:
   Done in d56a5b2e5 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6592: Add additional docstrings to Window function implementations

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592#discussion_r1222156771


##########
datafusion/physical-expr/src/window/mod.rs:
##########
@@ -32,6 +32,7 @@ mod window_frame_state;
 pub use aggregate::PlainAggregateWindowExpr;
 pub use built_in::BuiltInWindowExpr;
 pub use built_in_window_function_expr::BuiltInWindowFunctionExpr;
+pub use partition_evaluator::PartitionEvaluator;

Review Comment:
   `PartitionEvaluator` now appears in public docstrings (and I propose to make it part of the user defined window function API -- see https://github.com/apache/arrow-datafusion/issues/5781#issuecomment-1581428228)



##########
datafusion/physical-expr/src/window/built_in_window_function_expr.rs:
##########
@@ -69,6 +73,8 @@ pub trait BuiltInWindowFunctionExpr: Send + Sync + std::fmt::Debug {
         false
     }
 
+    /// If returns true, [`Self::create_evaluator`] must implement

Review Comment:
   I was a little unclear on why `use_window_frame` was part of `BuiltInWindowFunctionExpr` and not `PartitionEvaluator` but I haven't looked into it in more detail



##########
datafusion/physical-expr/src/window/sliding_aggregate.rs:
##########
@@ -36,12 +36,10 @@ use crate::{
     expressions::PhysicalSortExpr, reverse_order_bys, AggregateExpr, PhysicalExpr,
 };
 
-/// A window expr that takes the form of an aggregate function
-/// Aggregate Window Expressions that have the form
-/// `OVER({ROWS | RANGE| GROUPS} BETWEEN UNBOUNDED PRECEDING AND ...)`
-/// e.g cumulative window frames uses `PlainAggregateWindowExpr`. Where as Aggregate Window Expressions
-/// that have the form `OVER({ROWS | RANGE| GROUPS} BETWEEN M {PRECEDING| FOLLOWING} AND ...)`
-/// e.g sliding window frames uses `SlidingAggregateWindowExpr`.
+/// A window expr that takes the form of an aggregate function that
+/// can be incrementally computed over sliding windows.
+///
+/// See comments on [`WindowExpr`] for more details.

Review Comment:
   Consolidated this description into `WindowExpr`



##########
datafusion/physical-expr/src/window/partition_evaluator.rs:
##########
@@ -25,24 +25,70 @@ use datafusion_common::{DataFusionError, ScalarValue};
 use std::fmt::Debug;
 use std::ops::Range;
 
-/// Partition evaluator
+/// Partition evaluator for Window Functions
+///
+/// An implementation of this trait is created and used for each
+/// partition defined by the OVER clause.
+///
+/// For example, evaluating `window_func(val) OVER (PARTITION BY col)`
+/// on the following data:
+///
+/// ```text
+/// col | val
+/// --- + ----
+///  A  | 1
+///  A  | 1
+///  C  | 2
+///  D  | 3
+///  D  | 3
+/// ```
+///
+/// Will instantiate three `PartitionEvaluator`s, one each for the
+/// partitions defined by `col=A`, `col=B`, and `col=C`.
+///
+/// There are two types of `PartitionEvaluator`:
+///
+/// # Stateless `PartitionEvaluator`
+///

Review Comment:
   @mustafasrepo  / @ozankabak  if you have time to help me describe more clearly what Stateful and Stateless `PartitionEvaluator`s are , and specifically what is different between them, I would be most appreciative



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6592: Add additional docstrings to Window function implementations

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592#discussion_r1223130985


##########
datafusion/physical-expr/src/window/partition_evaluator.rs:
##########
@@ -25,24 +25,70 @@ use datafusion_common::{DataFusionError, ScalarValue};
 use std::fmt::Debug;
 use std::ops::Range;
 
-/// Partition evaluator
+/// Partition evaluator for Window Functions
+///
+/// An implementation of this trait is created and used for each
+/// partition defined by the OVER clause.
+///
+/// For example, evaluating `window_func(val) OVER (PARTITION BY col)`
+/// on the following data:
+///
+/// ```text
+/// col | val
+/// --- + ----
+///  A  | 1
+///  A  | 1
+///  C  | 2
+///  D  | 3
+///  D  | 3
+/// ```
+///
+/// Will instantiate three `PartitionEvaluator`s, one each for the
+/// partitions defined by `col=A`, `col=B`, and `col=C`.
+///
+/// There are two types of `PartitionEvaluator`:
+///
+/// # Stateless `PartitionEvaluator`
+///

Review Comment:
   Thank you @mustafasrepo  -- this is super helpful. I am incorporating this information into this PR



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #6592: Add additional docstrings to Window function implementations

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo commented on code in PR #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592#discussion_r1222707520


##########
datafusion/physical-expr/src/window/partition_evaluator.rs:
##########
@@ -25,24 +25,70 @@ use datafusion_common::{DataFusionError, ScalarValue};
 use std::fmt::Debug;
 use std::ops::Range;
 
-/// Partition evaluator
+/// Partition evaluator for Window Functions
+///
+/// An implementation of this trait is created and used for each
+/// partition defined by the OVER clause.
+///
+/// For example, evaluating `window_func(val) OVER (PARTITION BY col)`
+/// on the following data:
+///
+/// ```text
+/// col | val
+/// --- + ----
+///  A  | 1
+///  A  | 1
+///  C  | 2
+///  D  | 3
+///  D  | 3
+/// ```
+///
+/// Will instantiate three `PartitionEvaluator`s, one each for the
+/// partitions defined by `col=A`, `col=B`, and `col=C`.
+///
+/// There are two types of `PartitionEvaluator`:
+///
+/// # Stateless `PartitionEvaluator`
+///

Review Comment:
   Some builtin window functions use window frame information inside the window expression (those are `FIRST_VALUE`,  `LAST_VALUE`, `NTH_VALUE`). However, for most of the window functions what is in the window frame is not important (those are `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `PERCENT_RANK`, `CUME_DIST`, `LEAD`, `LAG`). For the ones, using window_frame `PartitionEvaluator::evaluate_inside_range` is called. For the ones that do not use window frame `PartitionEvaluator::evaluate` is called (For rank calculations, `PartitionEvaluator::evaluate_with_rank` is called since its API is quite different. However, it doesn't use window frame either.) 
   
   `PartitionEvaluator::evaluate_stateful` is used only when we produce window result with bounded memory(When window functions are called from the `BoundedWindowAggExec`). In this case window results are calculated in running fashion, hence we need to store previous state, to be able to calculate correct output (For instance, for `ROW_NUMBER` function the current batch evaluator receive may not be the first batch. Hence we cannot start row_number from 0, we need to start from last `ROW_NUMBER` produced for the previous batches received. Similarly, we need to store some information in the state. When we do not receive whole table as a single batch) 
   
   Currently, we have support for bounded(stateful) execution for `FIRST_VALUE`, `LAST_VALUE`, `NTH_VALUE`, `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LEAD`, `LAG`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6592: Minor: Add additional docstrings to Window function implementations

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592#issuecomment-1583095048

   @mustafasrepo  / @ozankabak  I wonder if you have time to review this documentation and check if I got it right? This PR has no code changes in it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6592: Minor: Add additional docstrings to Window function implementations

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #6592: Minor: Add additional docstrings to Window function implementations

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo commented on code in PR #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592#discussion_r1223906161


##########
datafusion/physical-expr/src/window/built_in_window_function_expr.rs:
##########
@@ -56,19 +62,36 @@ pub trait BuiltInWindowFunctionExpr: Send + Sync + std::fmt::Debug {
             .collect()
     }
 
-    /// Create built-in window evaluator with a batch
+    /// Create a [`PartitionEvaluator`] for evaluating the function on
+    /// a particular partition.
     fn create_evaluator(&self) -> Result<Box<dyn PartitionEvaluator>>;
 
-    /// Construct Reverse Expression that produces the same result
-    /// on a reversed window.  For example `lead(10)` --> `lag(10)`
+    /// Construct a new [`BuiltInWindowFunctionExpr`] that produces
+    /// the same result as this function on a window with reverse
+    /// order. The return value of this function is used by the
+    /// DataFusion optimizer to avoid resorting the data when

Review Comment:
   Maybe we can use `re-sorting` instead of `resorting`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org