You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mustafasrepo (via GitHub)" <gi...@apache.org> on 2023/06/08 09:24:18 UTC

[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #6592: Add additional docstrings to Window function implementations

mustafasrepo commented on code in PR #6592:
URL: https://github.com/apache/arrow-datafusion/pull/6592#discussion_r1222707520


##########
datafusion/physical-expr/src/window/partition_evaluator.rs:
##########
@@ -25,24 +25,70 @@ use datafusion_common::{DataFusionError, ScalarValue};
 use std::fmt::Debug;
 use std::ops::Range;
 
-/// Partition evaluator
+/// Partition evaluator for Window Functions
+///
+/// An implementation of this trait is created and used for each
+/// partition defined by the OVER clause.
+///
+/// For example, evaluating `window_func(val) OVER (PARTITION BY col)`
+/// on the following data:
+///
+/// ```text
+/// col | val
+/// --- + ----
+///  A  | 1
+///  A  | 1
+///  C  | 2
+///  D  | 3
+///  D  | 3
+/// ```
+///
+/// Will instantiate three `PartitionEvaluator`s, one each for the
+/// partitions defined by `col=A`, `col=B`, and `col=C`.
+///
+/// There are two types of `PartitionEvaluator`:
+///
+/// # Stateless `PartitionEvaluator`
+///

Review Comment:
   Some builtin window functions use window frame information inside the window expression (those are `FIRST_VALUE`,  `LAST_VALUE`, `NTH_VALUE`). However, for most of the window functions what is in the window frame is not important (those are `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `PERCENT_RANK`, `CUME_DIST`, `LEAD`, `LAG`). For the ones, using window_frame `PartitionEvaluator::evaluate_inside_range` is called. For the ones that do not use window frame `PartitionEvaluator::evaluate` is called (For rank calculations, `PartitionEvaluator::evaluate_with_rank` is called since its API is quite different. However, it doesn't use window frame either.) 
   
   `PartitionEvaluator::evaluate_stateful` is used only when we produce window result with bounded memory(When window functions are called from the `BoundedWindowAggExec`). In this case window results are calculated in running fashion, hence we need to store previous state, to be able to calculate correct output (For instance, for `ROW_NUMBER` function the current batch evaluator receive may not be the first batch. Hence we cannot start row_number from 0, we need to start from last `ROW_NUMBER` produced for the previous batches received. Similarly, we need to store some information in the state. When we do not receive whole table as a single batch) 
   
   Currently, we have support for bounded(stateful) execution for `FIRST_VALUE`, `LAST_VALUE`, `NTH_VALUE`, `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LEAD`, `LAG`. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org