You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/26 14:18:22 UTC

[GitHub] [arrow-datafusion] Jimexist opened a new pull request #429: implement lead and lag built-in window function

Jimexist opened a new pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429


   # Which issue does this PR close?
   
   implement lead and lag built-in window function. based on #403 so review that first
   
   Closes #.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-860225216


   > Actually let's park this pull request for a while - I plan to implement sort and partition first and then window frame, after which the window shift approach might not be relevant.
   
   now that #520 is implemented, this PR is ready


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#discussion_r642179178



##########
File path: datafusion/src/physical_plan/windows.rs
##########
@@ -421,18 +435,52 @@ async fn compute_window_aggregate(
     let aggregated_mapped = finalize_window_aggregation(&window_accumulators)
         .map_err(DataFusionError::into_arrow_external_error)?;
 
+    let window_shifts = window_accumulators
+        .iter()
+        .map(|acc| acc.window_shift())
+        .collect::<Vec<_>>();
+
     let mut columns: Vec<ArrayRef> = accumulator
         .iter()
         .zip(aggregated_mapped)
-        .map(|(acc, agg)| {
-            Ok(match (acc, agg) {
-                (acc, Some(scalar_value)) if acc.is_empty() => {
+        .zip(window_shifts)
+        .map(|((acc, agg), window_shift)| {
+            Ok(match (acc, agg, window_shift) {
+                (acc, Some(scalar_value), None) if acc.is_empty() => {
                     scalar_value.to_array_of_size(total_num_rows)
                 }
-                (acc, None) if !acc.is_empty() => {
+                (acc, None, window_shift) if !acc.is_empty() => {
                     let vec_array: Vec<&dyn Array> =
                         acc.iter().map(|arc| arc.as_ref()).collect();
-                    concat(&vec_array)?
+                    let arr: ArrayRef = concat(&vec_array)?;
+                    if arr.is_empty() {
+                        arr
+                    } else if arr.len() != total_num_rows {
+                        return Err(DataFusionError::Internal(format!(
+                            "Invalid concatenated array of length {}, expecting {}",
+                            arr.len(),
+                            total_num_rows
+                        )));
+                    } else {
+                        let data_type = arr.data_type();
+                        match window_shift {

Review comment:
       @Dandandan turns out it's not that easy. it's only for primitive array.
   
   also i'd need to fix some boundary cases as well https://github.com/apache/arrow-rs/pull/386




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-861103557


   putting this back to draft as this relies on https://github.com/apache/arrow-rs/pull/388 which is not yet in arrow 4.3


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on a change in pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Dandandan commented on a change in pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#discussion_r639800391



##########
File path: datafusion/src/physical_plan/windows.rs
##########
@@ -421,18 +435,52 @@ async fn compute_window_aggregate(
     let aggregated_mapped = finalize_window_aggregation(&window_accumulators)
         .map_err(DataFusionError::into_arrow_external_error)?;
 
+    let window_shifts = window_accumulators
+        .iter()
+        .map(|acc| acc.window_shift())
+        .collect::<Vec<_>>();
+
     let mut columns: Vec<ArrayRef> = accumulator
         .iter()
         .zip(aggregated_mapped)
-        .map(|(acc, agg)| {
-            Ok(match (acc, agg) {
-                (acc, Some(scalar_value)) if acc.is_empty() => {
+        .zip(window_shifts)
+        .map(|((acc, agg), window_shift)| {
+            Ok(match (acc, agg, window_shift) {
+                (acc, Some(scalar_value), None) if acc.is_empty() => {
                     scalar_value.to_array_of_size(total_num_rows)
                 }
-                (acc, None) if !acc.is_empty() => {
+                (acc, None, window_shift) if !acc.is_empty() => {
                     let vec_array: Vec<&dyn Array> =
                         acc.iter().map(|arc| arc.as_ref()).collect();
-                    concat(&vec_array)?
+                    let arr: ArrayRef = concat(&vec_array)?;
+                    if arr.is_empty() {
+                        arr
+                    } else if arr.len() != total_num_rows {
+                        return Err(DataFusionError::Internal(format!(
+                            "Invalid concatenated array of length {}, expecting {}",
+                            arr.len(),
+                            total_num_rows
+                        )));
+                    } else {
+                        let data_type = arr.data_type();
+                        match window_shift {

Review comment:
       What about using: https://docs.rs/arrow/4.1.0/arrow/compute/kernels/window/fn.shift.html




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-853192661


   Actually let's park this pull request for a while - I plan to implement sort and partition first and then window frame, after which the window shift approach might not be relevant.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#discussion_r642219305



##########
File path: datafusion/src/physical_plan/windows.rs
##########
@@ -421,18 +435,52 @@ async fn compute_window_aggregate(
     let aggregated_mapped = finalize_window_aggregation(&window_accumulators)
         .map_err(DataFusionError::into_arrow_external_error)?;
 
+    let window_shifts = window_accumulators
+        .iter()
+        .map(|acc| acc.window_shift())
+        .collect::<Vec<_>>();
+
     let mut columns: Vec<ArrayRef> = accumulator
         .iter()
         .zip(aggregated_mapped)
-        .map(|(acc, agg)| {
-            Ok(match (acc, agg) {
-                (acc, Some(scalar_value)) if acc.is_empty() => {
+        .zip(window_shifts)
+        .map(|((acc, agg), window_shift)| {
+            Ok(match (acc, agg, window_shift) {
+                (acc, Some(scalar_value), None) if acc.is_empty() => {
                     scalar_value.to_array_of_size(total_num_rows)
                 }
-                (acc, None) if !acc.is_empty() => {
+                (acc, None, window_shift) if !acc.is_empty() => {
                     let vec_array: Vec<&dyn Array> =
                         acc.iter().map(|arc| arc.as_ref()).collect();
-                    concat(&vec_array)?
+                    let arr: ArrayRef = concat(&vec_array)?;
+                    if arr.is_empty() {
+                        arr
+                    } else if arr.len() != total_num_rows {
+                        return Err(DataFusionError::Internal(format!(
+                            "Invalid concatenated array of length {}, expecting {}",
+                            arr.len(),
+                            total_num_rows
+                        )));
+                    } else {
+                        let data_type = arr.data_type();
+                        match window_shift {

Review comment:
       i can wait for https://github.com/apache/arrow-rs/pull/388 to be merged or maybe merge this first and refactor into that later when arrow new release is available




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-872564179


   Thanks @Jimexist  -- I ran out of time today but will check this out tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#discussion_r660884744



##########
File path: datafusion/src/physical_plan/expressions/lead_lag.rs
##########
@@ -0,0 +1,180 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines physical expression for `lead` and `lag` that can evaluated
+//! at runtime during query execution
+
+use crate::error::{DataFusionError, Result};
+use crate::physical_plan::window_functions::PartitionEvaluator;
+use crate::physical_plan::{window_functions::BuiltInWindowFunctionExpr, PhysicalExpr};
+use arrow::array::ArrayRef;
+use arrow::compute::kernels::window::shift;
+use arrow::datatypes::{DataType, Field};
+use arrow::record_batch::RecordBatch;
+use std::any::Any;
+use std::ops::Range;
+use std::sync::Arc;
+
+/// window shift expression
+#[derive(Debug)]
+pub struct WindowShift {
+    name: String,
+    data_type: DataType,
+    shift_offset: i64,
+    expr: Arc<dyn PhysicalExpr>,
+}
+
+/// lead() window function
+pub fn lead(
+    name: String,
+    data_type: DataType,
+    expr: Arc<dyn PhysicalExpr>,
+) -> WindowShift {
+    WindowShift {
+        name,
+        data_type,
+        shift_offset: -1,
+        expr,
+    }
+}
+
+/// lag() window function
+pub fn lag(
+    name: String,
+    data_type: DataType,
+    expr: Arc<dyn PhysicalExpr>,
+) -> WindowShift {
+    WindowShift {
+        name,
+        data_type,
+        shift_offset: 1,
+        expr,
+    }
+}
+
+impl BuiltInWindowFunctionExpr for WindowShift {
+    /// Return a reference to Any that can be used for downcasting
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+
+    fn field(&self) -> Result<Field> {
+        let nullable = true;
+        Ok(Field::new(&self.name, self.data_type.clone(), nullable))
+    }
+
+    fn expressions(&self) -> Vec<Arc<dyn PhysicalExpr>> {
+        vec![self.expr.clone()]
+    }
+
+    fn name(&self) -> &str {
+        &self.name
+    }
+
+    fn create_evaluator(
+        &self,
+        batch: &RecordBatch,
+    ) -> Result<Box<dyn PartitionEvaluator>> {
+        let values = self
+            .expressions()
+            .iter()
+            .map(|e| e.evaluate(batch))
+            .map(|r| r.map(|v| v.into_array(batch.num_rows())))
+            .collect::<Result<Vec<_>>>()?;
+        Ok(Box::new(WindowShiftEvaluator {
+            shift_offset: self.shift_offset,
+            values,
+        }))
+    }
+}
+
+pub(crate) struct WindowShiftEvaluator {
+    shift_offset: i64,
+    values: Vec<ArrayRef>,
+}
+
+impl PartitionEvaluator for WindowShiftEvaluator {
+    fn evaluate_partition(&self, _partition: Range<usize>) -> Result<ArrayRef> {
+        let value = &self.values[0];
+        shift(value.as_ref(), self.shift_offset).map_err(DataFusionError::ArrowError)

Review comment:
       do you need to restrict the window to the partition bounds? If the input array had 10 rows in 2 partitions, wouldn't this code produce 2 output partitions of 10 rows each (rather than 2 output partitions of 5 rows each)?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#discussion_r643391389



##########
File path: datafusion/src/physical_plan/windows.rs
##########
@@ -421,18 +435,52 @@ async fn compute_window_aggregate(
     let aggregated_mapped = finalize_window_aggregation(&window_accumulators)
         .map_err(DataFusionError::into_arrow_external_error)?;
 
+    let window_shifts = window_accumulators
+        .iter()
+        .map(|acc| acc.window_shift())
+        .collect::<Vec<_>>();
+
     let mut columns: Vec<ArrayRef> = accumulator
         .iter()
         .zip(aggregated_mapped)
-        .map(|(acc, agg)| {
-            Ok(match (acc, agg) {
-                (acc, Some(scalar_value)) if acc.is_empty() => {
+        .zip(window_shifts)
+        .map(|((acc, agg), window_shift)| {
+            Ok(match (acc, agg, window_shift) {
+                (acc, Some(scalar_value), None) if acc.is_empty() => {
                     scalar_value.to_array_of_size(total_num_rows)
                 }
-                (acc, None) if !acc.is_empty() => {
+                (acc, None, window_shift) if !acc.is_empty() => {
                     let vec_array: Vec<&dyn Array> =
                         acc.iter().map(|arc| arc.as_ref()).collect();
-                    concat(&vec_array)?
+                    let arr: ArrayRef = concat(&vec_array)?;
+                    if arr.is_empty() {
+                        arr
+                    } else if arr.len() != total_num_rows {
+                        return Err(DataFusionError::Internal(format!(
+                            "Invalid concatenated array of length {}, expecting {}",
+                            arr.len(),
+                            total_num_rows
+                        )));
+                    } else {
+                        let data_type = arr.data_type();
+                        match window_shift {

Review comment:
       Thanks @Jimexist  -- I am working through the PRs now. I'll get to this one shortly




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-870594468


   @alamb and @Dandandan this pull request is ready now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-861383836


   > putting this back to draft as this relies on apache/arrow-rs#388 which is not yet in arrow 4.3
   
   Oh no!
   
   Can we possibly use the API that is in Arrow 4.3 (and then we can upgrade datafusion to use the new api when the next version of Arrow comes out)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-861419260


   > > putting this back to draft as this relies on apache/arrow-rs#388 which is not yet in arrow 4.3
   > 
   > 
   > 
   > Oh no!
   > 
   > 
   > 
   > Can we possibly use the API that is in Arrow 4.3 (and then we can upgrade datafusion to use the new api when the next version of Arrow comes out)?
   
   I don't mind parking this one here for a while since there would be many other window frame stuff to be done before revisiting this and by that time newer version would be released


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-852533565


   I plan to review this PR tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#discussion_r661893611



##########
File path: datafusion/src/physical_plan/expressions/lead_lag.rs
##########
@@ -0,0 +1,180 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! Defines physical expression for `lead` and `lag` that can evaluated
+//! at runtime during query execution
+
+use crate::error::{DataFusionError, Result};
+use crate::physical_plan::window_functions::PartitionEvaluator;
+use crate::physical_plan::{window_functions::BuiltInWindowFunctionExpr, PhysicalExpr};
+use arrow::array::ArrayRef;
+use arrow::compute::kernels::window::shift;
+use arrow::datatypes::{DataType, Field};
+use arrow::record_batch::RecordBatch;
+use std::any::Any;
+use std::ops::Range;
+use std::sync::Arc;
+
+/// window shift expression
+#[derive(Debug)]
+pub struct WindowShift {
+    name: String,
+    data_type: DataType,
+    shift_offset: i64,
+    expr: Arc<dyn PhysicalExpr>,
+}
+
+/// lead() window function
+pub fn lead(
+    name: String,
+    data_type: DataType,
+    expr: Arc<dyn PhysicalExpr>,
+) -> WindowShift {
+    WindowShift {
+        name,
+        data_type,
+        shift_offset: -1,
+        expr,
+    }
+}
+
+/// lag() window function
+pub fn lag(
+    name: String,
+    data_type: DataType,
+    expr: Arc<dyn PhysicalExpr>,
+) -> WindowShift {
+    WindowShift {
+        name,
+        data_type,
+        shift_offset: 1,
+        expr,
+    }
+}
+
+impl BuiltInWindowFunctionExpr for WindowShift {
+    /// Return a reference to Any that can be used for downcasting
+    fn as_any(&self) -> &dyn Any {
+        self
+    }
+
+    fn field(&self) -> Result<Field> {
+        let nullable = true;
+        Ok(Field::new(&self.name, self.data_type.clone(), nullable))
+    }
+
+    fn expressions(&self) -> Vec<Arc<dyn PhysicalExpr>> {
+        vec![self.expr.clone()]
+    }
+
+    fn name(&self) -> &str {
+        &self.name
+    }
+
+    fn create_evaluator(
+        &self,
+        batch: &RecordBatch,
+    ) -> Result<Box<dyn PartitionEvaluator>> {
+        let values = self
+            .expressions()
+            .iter()
+            .map(|e| e.evaluate(batch))
+            .map(|r| r.map(|v| v.into_array(batch.num_rows())))
+            .collect::<Result<Vec<_>>>()?;
+        Ok(Box::new(WindowShiftEvaluator {
+            shift_offset: self.shift_offset,
+            values,
+        }))
+    }
+}
+
+pub(crate) struct WindowShiftEvaluator {
+    shift_offset: i64,
+    values: Vec<ArrayRef>,
+}
+
+impl PartitionEvaluator for WindowShiftEvaluator {
+    fn evaluate_partition(&self, _partition: Range<usize>) -> Result<ArrayRef> {
+        let value = &self.values[0];
+        shift(value.as_ref(), self.shift_offset).map_err(DataFusionError::ArrowError)

Review comment:
       @alamb good catch, this is fixed and add with integration tests.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter edited a comment on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-849123108


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#429](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f763205) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/321fda40a47bcc494c5d2511b6e8b02c9ea975b4?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (321fda4) will **decrease** coverage by `0.01%`.
   > The diff coverage is `65.47%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/429/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #429      +/-   ##
   ==========================================
   - Coverage   75.16%   75.15%   -0.02%     
   ==========================================
     Files         150      152       +2     
     Lines       25144    25357     +213     
   ==========================================
   + Hits        18899    19056     +157     
   - Misses       6245     6301      +56     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [ballista/rust/client/src/columnar\_batch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jbGllbnQvc3JjL2NvbHVtbmFyX2JhdGNoLnJz) | `0.00% <ø> (ø)` | |
   | [ballista/rust/core/src/serde/scheduler/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9zY2hlZHVsZXIvbW9kLnJz) | `14.28% <ø> (ø)` | |
   | [datafusion/src/physical\_plan/expressions/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9tb2QucnM=) | `71.42% <ø> (ø)` | |
   | [...tafusion/src/physical\_plan/expressions/lead\_lag.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9sZWFkX2xhZy5ycw==) | `38.29% <38.29%> (ø)` | |
   | [...afusion/src/physical\_plan/expressions/nth\_value.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9udGhfdmFsdWUucnM=) | `70.76% <70.76%> (ø)` | |
   | [datafusion/src/physical\_plan/windows.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi93aW5kb3dzLnJz) | `78.32% <72.11%> (+4.54%)` | :arrow_up: |
   | [datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9tb2QucnM=) | `79.09% <100.00%> (+0.38%)` | :arrow_up: |
   | [datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi90ZXN0cy9zcWwucnM=) | `99.89% <100.00%> (ø)` | |
   | ... and [2 more](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [321fda4...f763205](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#issuecomment-849123108


   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#429](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (26ca0fe) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/b38282990a3a3ec3c3c3963e96158f879df0ffe2?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (b382829) will **decrease** coverage by `0.01%`.
   > The diff coverage is `65.47%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-datafusion/pull/429/graphs/tree.svg?width=650&height=150&src=pr&token=JXwWBKD3D9&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #429      +/-   ##
   ==========================================
   - Coverage   75.34%   75.33%   -0.02%     
   ==========================================
     Files         147      149       +2     
     Lines       24782    24995     +213     
   ==========================================
   + Hits        18673    18830     +157     
   - Misses       6109     6165      +56     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [ballista/rust/client/src/columnar\_batch.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jbGllbnQvc3JjL2NvbHVtbmFyX2JhdGNoLnJz) | `0.00% <ø> (ø)` | |
   | [ballista/rust/core/src/serde/scheduler/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YmFsbGlzdGEvcnVzdC9jb3JlL3NyYy9zZXJkZS9zY2hlZHVsZXIvbW9kLnJz) | `14.28% <ø> (ø)` | |
   | [datafusion/src/physical\_plan/expressions/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9tb2QucnM=) | `71.42% <ø> (ø)` | |
   | [...tafusion/src/physical\_plan/expressions/lead\_lag.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9sZWFkX2xhZy5ycw==) | `38.29% <38.29%> (ø)` | |
   | [...afusion/src/physical\_plan/expressions/nth\_value.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9leHByZXNzaW9ucy9udGhfdmFsdWUucnM=) | `70.76% <70.76%> (ø)` | |
   | [datafusion/src/physical\_plan/windows.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi93aW5kb3dzLnJz) | `78.32% <72.11%> (+4.54%)` | :arrow_up: |
   | [datafusion/src/physical\_plan/mod.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9zcmMvcGh5c2ljYWxfcGxhbi9tb2QucnM=) | `79.09% <100.00%> (+0.38%)` | :arrow_up: |
   | [datafusion/tests/sql.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi90ZXN0cy9zcWwucnM=) | `99.89% <100.00%> (ø)` | |
   | ... and [2 more](https://codecov.io/gh/apache/arrow-datafusion/pull/429/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [b382829...26ca0fe](https://codecov.io/gh/apache/arrow-datafusion/pull/429?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #429: implement lead and lag built-in window function

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #429:
URL: https://github.com/apache/arrow-datafusion/pull/429#discussion_r639948357



##########
File path: datafusion/src/physical_plan/windows.rs
##########
@@ -421,18 +435,52 @@ async fn compute_window_aggregate(
     let aggregated_mapped = finalize_window_aggregation(&window_accumulators)
         .map_err(DataFusionError::into_arrow_external_error)?;
 
+    let window_shifts = window_accumulators
+        .iter()
+        .map(|acc| acc.window_shift())
+        .collect::<Vec<_>>();
+
     let mut columns: Vec<ArrayRef> = accumulator
         .iter()
         .zip(aggregated_mapped)
-        .map(|(acc, agg)| {
-            Ok(match (acc, agg) {
-                (acc, Some(scalar_value)) if acc.is_empty() => {
+        .zip(window_shifts)
+        .map(|((acc, agg), window_shift)| {
+            Ok(match (acc, agg, window_shift) {
+                (acc, Some(scalar_value), None) if acc.is_empty() => {
                     scalar_value.to_array_of_size(total_num_rows)
                 }
-                (acc, None) if !acc.is_empty() => {
+                (acc, None, window_shift) if !acc.is_empty() => {
                     let vec_array: Vec<&dyn Array> =
                         acc.iter().map(|arc| arc.as_ref()).collect();
-                    concat(&vec_array)?
+                    let arr: ArrayRef = concat(&vec_array)?;
+                    if arr.is_empty() {
+                        arr
+                    } else if arr.len() != total_num_rows {
+                        return Err(DataFusionError::Internal(format!(
+                            "Invalid concatenated array of length {}, expecting {}",
+                            arr.len(),
+                            total_num_rows
+                        )));
+                    } else {
+                        let data_type = arr.data_type();
+                        match window_shift {

Review comment:
       Thanks! That's handy!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org