You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/14 13:10:26 UTC
[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6665: Combine evaluate_stateful and evaluate_inside_range

alamb commented on code in PR #6665:
URL: https://github.com/apache/arrow-datafusion/pull/6665#discussion_r1229589764


##########
datafusion/physical-expr/src/window/built_in_window_function_expr.rs:
##########
@@ -86,27 +86,10 @@ pub trait BuiltInWindowFunctionExpr: Send + Sync + std::fmt::Debug {
     /// The default implementation does nothing
     fn add_equal_orderings(&self, _builder: &mut OrderingEquivalenceBuilder) {}
 
-    /// Can the window function be incrementally computed using
-    /// bounded memory?
-    ///
-    /// If this function returns true, [`Self::create_evaluator`] must
-    /// implement [`PartitionEvaluator::evaluate_stateful`]
-    fn supports_bounded_execution(&self) -> bool {
-        false
-    }
-
-    /// Does the window function use the values from its window frame?
-    ///
-    /// If this function returns true, [`Self::create_evaluator`] must
-    /// implement [`PartitionEvaluator::evaluate_inside_range`]
-    fn uses_window_frame(&self) -> bool {
-        false
-    }
-
     /// Can this function be evaluated with (only) rank
     ///
     /// If `include_rank` is true, then [`Self::create_evaluator`] must
-    /// implement [`PartitionEvaluator::evaluate_with_rank`]
+    /// implement [`PartitionEvaluator::evaluate_with_rank_all`]

Review Comment:
   Do you plan to give the same treatment (move to `PartitionEvaluator`) to `include_rank`?



##########
datafusion/physical-expr/src/window/partition_evaluator.rs:
##########
@@ -132,23 +140,40 @@ pub trait PartitionEvaluator: Debug + Send {
 
     /// Called for window functions that *do not use* values from the
     /// the window frame, such as `ROW_NUMBER`, `RANK`, `DENSE_RANK`,
-    /// `PERCENT_RANK`, `CUME_DIST`, `LEAD`, `LAG`).
-    fn evaluate(&self, _values: &[ArrayRef], _num_rows: usize) -> Result<ArrayRef> {
-        Err(DataFusionError::NotImplemented(
-            "evaluate is not implemented by default".into(),
-        ))
+    /// `PERCENT_RANK`, `CUME_DIST`, `LEAD`, `LAG`). It produces result
+    /// of all rows in a single pass. It expects to receive whole table data
+    /// as a single batch.
+    fn evaluate_all(&mut self, values: &[ArrayRef], num_rows: usize) -> Result<ArrayRef> {
+        // When window frame boundaries are not used and evaluator supports bounded execution
+        // We can calculate evaluate result by repeatedly calling `self.evaluate` `num_rows` times
+        // If user wants to implement more efficient version, this method should be overwritten
+        // Default implementation may behave suboptimally (For instance `NumRowEvaluator` overwrites it)
+        if !self.uses_window_frame() && self.supports_bounded_execution() {
+            let res = (0..num_rows)
+                .map(|_idx| self.evaluate(values, &Range { start: 0, end: 1 }))
+                .collect::<Result<Vec<_>>>()?;
+            ScalarValue::iter_to_array(res.into_iter())
+        } else {
+            Err(DataFusionError::NotImplemented(
+                "evaluate_all is not implemented by default".into(),

Review Comment:
   ```suggestion
                   format!("evaluate_all is not implemented for {} when using window frames", self.name()),
   ```
   
   I think the error could be more helpful, but I can also add that as a follow on PR
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org