You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "berkaysynnada (via GitHub)" <gi...@apache.org> on 2023/05/23 07:24:36 UTC

[GitHub] [arrow-datafusion] berkaysynnada opened a new pull request, #6419: Named window support

berkaysynnada opened a new pull request, #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #6125.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   With 0.34 release of sql-parser, statements can differentiate the named windows. We can now support such queries:
   ```
   SELECT SUM(amount) OVER running_window, AVG(amount) OVER running_window
                FROM sales_us
                WINDOW running_window AS (ORDER BY ts)
                ORDER BY ts
                LIMIT 5;
   ```
   
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   After the query parse, WindowType of the function can be in the form of WindowSpec or NamedWindow. In `select_to_plan()` function, if there exist some NamedWindow's, there are reshaped as WindowSpec's by using the definition of the windows. If there are any undefined windows, an error is generated.
   
   sqlparser dependency is also updated to 0.34 with this PR.
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Yes, slt tests are added for both working and erroneous conditions.
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #6419: Named window support

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo commented on code in PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#discussion_r1201689289


##########
datafusion/sql/src/expr/mod.rs:
##########
@@ -353,11 +353,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         let order_by = if let Some(order_by) = order_by {
             // TODO: Once sqlparser supports multiple order by clause, handle it
             //       see issue: https://github.com/sqlparser-rs/sqlparser-rs/issues/875
-            Some(vec![self.order_by_to_sort_expr(
-                *order_by,
-                input_schema,
-                planner_context,
-            )?])
+            Some(self.order_by_to_sort_expr(&order_by, input_schema, planner_context)?)

Review Comment:
   I think you should remove this TODO!.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6419: Named window support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#discussion_r1204127393


##########
datafusion/core/tests/sqllogictests/test_files/window.slt:
##########
@@ -3023,3 +3023,64 @@ drop table annotated_data_finite2
 
 statement ok
 drop table annotated_data_infinite2
+
+# window3 spec is not used in window functions.
+# The query should still work.
+query RR
+SELECT
+  MAX(c12) OVER window1 as min1,
+  MIN(c12) OVER window2 as max1
+  FROM aggregate_test_100
+  WINDOW window1 AS (ORDER BY C12),
+  window2 AS (PARTITION BY C11),
+  window3 AS (ORDER BY C1)
+  ORDER BY C3
+  LIMIT 5
+----
+0.970671228336 0.970671228336
+0.850672105305 0.850672105305
+0.152498292972 0.152498292972
+0.369363046006 0.369363046006
+0.56535284223 0.56535284223
+
+query TT
+EXPLAIN SELECT
+  MAX(c12) OVER window1 as min1,
+  MIN(c12) OVER window2 as max1
+  FROM aggregate_test_100
+  WINDOW window1 AS (ORDER BY C12),
+  window2 AS (PARTITION BY C11),
+  window3 AS (ORDER BY C1)
+  ORDER BY C3
+  LIMIT 5
+----
+logical_plan
+Projection: min1, max1
+--Limit: skip=0, fetch=5
+----Sort: aggregate_test_100.c3 ASC NULLS LAST, fetch=5
+------Projection: MAX(aggregate_test_100.c12) ORDER BY [aggregate_test_100.c12 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS min1, MIN(aggregate_test_100.c12) PARTITION BY [aggregate_test_100.c11] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS max1, aggregate_test_100.c3
+--------WindowAggr: windowExpr=[[MAX(aggregate_test_100.c12) ORDER BY [aggregate_test_100.c12 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
+----------Projection: aggregate_test_100.c3, aggregate_test_100.c12, MIN(aggregate_test_100.c12) PARTITION BY [aggregate_test_100.c11] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
+------------WindowAggr: windowExpr=[[MIN(aggregate_test_100.c12) PARTITION BY [aggregate_test_100.c11] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
+--------------TableScan: aggregate_test_100 projection=[c3, c11, c12]
+physical_plan
+ProjectionExec: expr=[min1@0 as min1, max1@1 as max1]
+--GlobalLimitExec: skip=0, fetch=5
+----SortExec: fetch=5, expr=[c3@2 ASC NULLS LAST]
+------ProjectionExec: expr=[MAX(aggregate_test_100.c12) ORDER BY [aggregate_test_100.c12 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW@3 as min1, MIN(aggregate_test_100.c12) PARTITION BY [aggregate_test_100.c11] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@2 as max1, c3@0 as c3]
+--------BoundedWindowAggExec: wdw=[MAX(aggregate_test_100.c12): Ok(Field { name: "MAX(aggregate_test_100.c12)", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Range, start_bound: Preceding(Float64(NULL)), end_bound: CurrentRow }], mode=[Sorted]
+----------SortExec: expr=[c12@1 ASC NULLS LAST]
+------------ProjectionExec: expr=[c3@0 as c3, c12@2 as c12, MIN(aggregate_test_100.c12) PARTITION BY [aggregate_test_100.c11] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@3 as MIN(aggregate_test_100.c12)]
+--------------WindowAggExec: wdw=[MIN(aggregate_test_100.c12): Ok(Field { name: "MIN(aggregate_test_100.c12)", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)) }]
+----------------SortExec: expr=[c11@1 ASC NULLS LAST]
+------------------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c3, c11, c12], has_header=true
+
+# window2 spec is not defined

Review Comment:
   👍 



##########
datafusion/sql/src/select.rs:
##########
@@ -70,10 +71,37 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         // process `where` clause
         let plan = self.plan_selection(select.selection, plan, planner_context)?;
 
+        // handle named windows before processing the projection expression

Review Comment:
   I think the code would be nicer if this code to expand out named window definitions was refactored into a sub function -- something like `fn expand_window_definitions()` or something, perhaps



##########
datafusion/sql/src/select.rs:
##########
@@ -70,10 +71,37 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         // process `where` clause
         let plan = self.plan_selection(select.selection, plan, planner_context)?;
 
+        // handle named windows before processing the projection expression
+        let mut modified_projection = select.projection.clone();
+        // If the projection is done over a named window, that window
+        // name must be defined. Otherwise, it gives an error.
+        for proj in modified_projection.iter_mut() {
+            if let SelectItem::ExprWithAlias {
+                expr: SQLExpr::Function(f),
+                alias: _,
+            } = proj
+            {
+                for NamedWindowDefinition(window_ident, window_spec) in
+                    select.named_window.iter()
+                {
+                    if let Some(WindowType::NamedWindow(ident)) = &f.over {
+                        if ident.eq(window_ident) {
+                            f.over = Some(WindowType::WindowSpec(window_spec.clone()))
+                        }
+                    }
+                }
+                // All named windows must be defined with a WindowSpec.

Review Comment:
   Is it worth checking here for duplicated window names (aka two definitions of `window1`?)



##########
datafusion/core/tests/sqllogictests/test_files/window.slt:
##########
@@ -3023,3 +3023,64 @@ drop table annotated_data_finite2
 
 statement ok
 drop table annotated_data_infinite2
+
+# window3 spec is not used in window functions.

Review Comment:
   Another test that might be valuable is to to use the same named window more than once as I think this is one reason named windows are often used.
   
   Something like this perhaps, which uses `window1` more than once:
   
   ```sql
   SELECT
     MAX(c12) OVER window1 as min1,
     MIN(c12) OVER window1 as max1
     FROM aggregate_test_100
     WINDOW window1 AS (ORDER BY C12),
     ORDER BY C3
     LIMIT 5
   
   ```



##########
datafusion/sql/src/select.rs:
##########
@@ -70,10 +71,37 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         // process `where` clause
         let plan = self.plan_selection(select.selection, plan, planner_context)?;
 
+        // handle named windows before processing the projection expression
+        let mut modified_projection = select.projection.clone();

Review Comment:
   You may be able to avoid a clone here by modifing `select.projection` in place



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6419: Named window support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#discussion_r1205834575


##########
datafusion/sql/src/select.rs:
##########
@@ -520,3 +525,49 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         Ok((plan, select_exprs_post_aggr, having_expr_post_aggr))
     }
 }
+
+// If there are any multiple-defined windows, we raise an error.
+fn check_conflicting_windows(window_defs: &[NamedWindowDefinition]) -> Result<()> {

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo commented on a diff in pull request #6419: Named window support

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo commented on code in PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#discussion_r1201689289


##########
datafusion/sql/src/expr/mod.rs:
##########
@@ -353,11 +353,7 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         let order_by = if let Some(order_by) = order_by {
             // TODO: Once sqlparser supports multiple order by clause, handle it
             //       see issue: https://github.com/sqlparser-rs/sqlparser-rs/issues/875
-            Some(vec![self.order_by_to_sort_expr(
-                *order_by,
-                input_schema,
-                planner_context,
-            )?])
+            Some(self.order_by_to_sort_expr(&order_by, input_schema, planner_context)?)

Review Comment:
   I think you should remove this TODO!. 
   I think it would be great to add an example to `groupby.slt` file, where an aggregate function with lexicographical ordering requirement is used.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6419: Named window support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#issuecomment-1559475057

   https://github.com/apache/arrow-datafusion/pull/6416 is merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] berkaysynnada commented on a diff in pull request #6419: Named window support

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on code in PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#discussion_r1205060247


##########
datafusion/sql/src/select.rs:
##########
@@ -70,10 +71,37 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         // process `where` clause
         let plan = self.plan_selection(select.selection, plan, planner_context)?;
 
+        // handle named windows before processing the projection expression
+        let mut modified_projection = select.projection.clone();
+        // If the projection is done over a named window, that window
+        // name must be defined. Otherwise, it gives an error.
+        for proj in modified_projection.iter_mut() {
+            if let SelectItem::ExprWithAlias {
+                expr: SQLExpr::Function(f),
+                alias: _,
+            } = proj
+            {
+                for NamedWindowDefinition(window_ident, window_spec) in
+                    select.named_window.iter()
+                {
+                    if let Some(WindowType::NamedWindow(ident)) = &f.over {
+                        if ident.eq(window_ident) {
+                            f.over = Some(WindowType::WindowSpec(window_spec.clone()))
+                        }
+                    }
+                }
+                // All named windows must be defined with a WindowSpec.

Review Comment:
   I have checked, it says "Query Error: error: window "window3" is already defined"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6419: Named window support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#issuecomment-1560171070

   I plan to review this tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] berkaysynnada commented on pull request #6419: Named window support

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#issuecomment-1558751462

   @alamb I have taken the necessary changes from your PR https://github.com/apache/arrow-datafusion/pull/6416 regarding the sqlparser update. I wonder what you suggest, should I remove those parts and wait for your PR to be merged, or you prefer to continue with this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] berkaysynnada commented on a diff in pull request #6419: Named window support

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on code in PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#discussion_r1204587073


##########
datafusion/sql/src/select.rs:
##########
@@ -70,10 +71,37 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         // process `where` clause
         let plan = self.plan_selection(select.selection, plan, planner_context)?;
 
+        // handle named windows before processing the projection expression
+        let mut modified_projection = select.projection.clone();
+        // If the projection is done over a named window, that window
+        // name must be defined. Otherwise, it gives an error.
+        for proj in modified_projection.iter_mut() {
+            if let SelectItem::ExprWithAlias {
+                expr: SQLExpr::Function(f),
+                alias: _,
+            } = proj
+            {
+                for NamedWindowDefinition(window_ident, window_spec) in
+                    select.named_window.iter()
+                {
+                    if let Some(WindowType::NamedWindow(ident)) = &f.over {
+                        if ident.eq(window_ident) {
+                            f.over = Some(WindowType::WindowSpec(window_spec.clone()))
+                        }
+                    }
+                }
+                // All named windows must be defined with a WindowSpec.

Review Comment:
   Thanks for the review, I agree with you. It would be better to add such a control. I think even if that multiple-defined window is not used, we should give an error, am I right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6419: Named window support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#discussion_r1204779057


##########
datafusion/sql/src/select.rs:
##########
@@ -70,10 +71,37 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         // process `where` clause
         let plan = self.plan_selection(select.selection, plan, planner_context)?;
 
+        // handle named windows before processing the projection expression
+        let mut modified_projection = select.projection.clone();
+        // If the projection is done over a named window, that window
+        // name must be defined. Otherwise, it gives an error.
+        for proj in modified_projection.iter_mut() {
+            if let SelectItem::ExprWithAlias {
+                expr: SQLExpr::Function(f),
+                alias: _,
+            } = proj
+            {
+                for NamedWindowDefinition(window_ident, window_spec) in
+                    select.named_window.iter()
+                {
+                    if let Some(WindowType::NamedWindow(ident)) = &f.over {
+                        if ident.eq(window_ident) {
+                            f.over = Some(WindowType::WindowSpec(window_spec.clone()))
+                        }
+                    }
+                }
+                // All named windows must be defined with a WindowSpec.

Review Comment:
   
   I recommend we do whatever postgres does (which I assume is throw an error, but I haven't checked)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6419: Named window support

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419#issuecomment-1559095276

   > This PR needs sql-parser 0.34 release also. @alamb I have taken the necessary changes from https://github.com/apache/arrow-datafusion/pull/6416 regarding the sqlparser update. I wonder what you suggest, should I remove those parts and wait for your PR to be merged, or you prefer to continue with this PR?
   
   I recommend we wait for https://github.com/apache/arrow-datafusion/pull/6416  to merge, but we can review this PR earlier.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mustafasrepo merged pull request #6419: Named window support

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.
mustafasrepo merged PR #6419:
URL: https://github.com/apache/arrow-datafusion/pull/6419


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org