You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/28 01:52:38 UTC

[GitHub] [arrow-datafusion] Jimexist opened a new pull request #638: add integration tests for rank, dense_rank

Jimexist opened a new pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638


   # Which issue does this PR close?
   
   add integration tests for rank, dense_rank
   
   Closes follow ups on #555 
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638#discussion_r660878629



##########
File path: datafusion/src/physical_plan/expressions/nth_value.rs
##########
@@ -138,21 +140,56 @@ pub(crate) struct NthValueEvaluator {
 }
 
 impl PartitionEvaluator for NthValueEvaluator {
-    fn evaluate_partition(&self, partition: Range<usize>) -> Result<ArrayRef> {
-        let value = &self.values[0];
+    fn include_rank(&self) -> bool {
+        true
+    }
+
+    fn evaluate_partition(&self, _partition: Range<usize>) -> Result<ArrayRef> {
+        unreachable!("first, last, and nth_value evaluation must be called with evaluate_partition_with_rank")
+    }
+
+    fn evaluate_partition_with_rank(
+        &self,
+        partition: Range<usize>,
+        ranks_in_partition: &[Range<usize>],
+    ) -> Result<ArrayRef> {
+        let arr = &self.values[0];
         let num_rows = partition.end - partition.start;
-        let value = value.slice(partition.start, num_rows);
-        let index: usize = match self.kind {
-            NthValueKind::First => 0,
-            NthValueKind::Last => (num_rows as usize) - 1,
-            NthValueKind::Nth(n) => (n as usize) - 1,
-        };
-        Ok(if index >= num_rows {
-            new_null_array(value.data_type(), num_rows)
-        } else {
-            let value = ScalarValue::try_from_array(&value, index)?;
-            value.to_array_of_size(num_rows)
-        })
+        match self.kind {
+            NthValueKind::First => {
+                let value = ScalarValue::try_from_array(arr, partition.start)?;
+                Ok(value.to_array_of_size(num_rows))
+            }
+            NthValueKind::Last => {
+                // because the default window frame is between unbounded preceding and current
+                // row with peer evaluation, hence the last rows expands until the end of the peers
+                let values = ranks_in_partition
+                    .iter()
+                    .map(|range| {
+                        let len = range.end - range.start;
+                        let value = ScalarValue::try_from_array(arr, range.end - 1)?;
+                        Ok(iter::repeat(value).take(len))
+                    })
+                    .collect::<Result<Vec<_>>>()?
+                    .into_iter()
+                    .flatten();
+                ScalarValue::iter_to_array(values)
+            }
+            NthValueKind::Nth(n) => {
+                let index = (n as usize) - 1;
+                if index >= num_rows {
+                    Ok(new_null_array(arr.data_type(), num_rows))
+                } else {
+                    let value =
+                        ScalarValue::try_from_array(arr, partition.start + index)?;
+                    let arr = value.to_array_of_size(num_rows);
+                    // because the default window frame is between unbounded preceding and current
+                    // row, hence the shift because for values with indices < index they should be
+                    // null. This changes when window frames other than default is implemented
+                    shift(arr.as_ref(), index as i64).map_err(DataFusionError::ArrowError)

Review comment:
       👍 

##########
File path: datafusion/src/physical_plan/expressions/nth_value.rs
##########
@@ -138,21 +140,56 @@ pub(crate) struct NthValueEvaluator {
 }
 
 impl PartitionEvaluator for NthValueEvaluator {
-    fn evaluate_partition(&self, partition: Range<usize>) -> Result<ArrayRef> {
-        let value = &self.values[0];
+    fn include_rank(&self) -> bool {
+        true
+    }
+
+    fn evaluate_partition(&self, _partition: Range<usize>) -> Result<ArrayRef> {
+        unreachable!("first, last, and nth_value evaluation must be called with evaluate_partition_with_rank")
+    }
+
+    fn evaluate_partition_with_rank(
+        &self,
+        partition: Range<usize>,
+        ranks_in_partition: &[Range<usize>],
+    ) -> Result<ArrayRef> {
+        let arr = &self.values[0];
         let num_rows = partition.end - partition.start;
-        let value = value.slice(partition.start, num_rows);
-        let index: usize = match self.kind {
-            NthValueKind::First => 0,
-            NthValueKind::Last => (num_rows as usize) - 1,
-            NthValueKind::Nth(n) => (n as usize) - 1,
-        };
-        Ok(if index >= num_rows {
-            new_null_array(value.data_type(), num_rows)
-        } else {
-            let value = ScalarValue::try_from_array(&value, index)?;
-            value.to_array_of_size(num_rows)
-        })
+        match self.kind {
+            NthValueKind::First => {
+                let value = ScalarValue::try_from_array(arr, partition.start)?;
+                Ok(value.to_array_of_size(num_rows))
+            }
+            NthValueKind::Last => {
+                // because the default window frame is between unbounded preceding and current
+                // row with peer evaluation, hence the last rows expands until the end of the peers
+                let values = ranks_in_partition
+                    .iter()
+                    .map(|range| {
+                        let len = range.end - range.start;
+                        let value = ScalarValue::try_from_array(arr, range.end - 1)?;

Review comment:
       this is very cool




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638#discussion_r660227050



##########
File path: datafusion/tests/sql.rs
##########
@@ -952,15 +952,8 @@ async fn csv_query_window_with_order_by() -> Result<()> {
     let actual = execute(&mut ctx, sql).await;
     let expected = vec![

Review comment:
       ```
   -[ RECORD 1 ]-----------------------
   c9          | 28774375
   sum         | 61035129
   avg         | 61035129.000000000000
   count       | 1
   max         | 61035129
   min         | 61035129
   first_value | 61035129
   last_value  | 61035129
   nth_value   | ¤
   -[ RECORD 2 ]-----------------------
   c9          | 63044568
   sum         | -47938237
   avg         | -23969118.500000000000
   count       | 2
   max         | 61035129
   min         | -108973366
   first_value | 61035129
   last_value  | -108973366
   nth_value   | -108973366
   -[ RECORD 3 ]-----------------------
   c9          | 141047417
   sum         | 575165281
   avg         | 191721760.33333333
   count       | 3
   max         | 623103518
   min         | -108973366
   first_value | 61035129
   last_value  | 623103518
   nth_value   | -108973366
   -[ RECORD 4 ]-----------------------
   c9          | 141680161
   sum         | -1352462829
   avg         | -338115707.25000000
   count       | 4
   max         | 623103518
   min         | -1927628110
   first_value | 61035129
   last_value  | -1927628110
   nth_value   | -108973366
   -[ RECORD 5 ]-----------------------
   c9          | 145294611
   sum         | -3251637940
   avg         | -650327588.00000000
   count       | 5
   max         | 623103518
   min         | -1927628110
   first_value | 61035129
   last_value  | -1899175111
   nth_value   | -108973366
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank

Posted by GitBox <gi...@apache.org>.
Jimexist commented on pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638#issuecomment-870594602


   @alamb and @Dandandan this pull request is ready now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank

Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638#discussion_r660227050



##########
File path: datafusion/tests/sql.rs
##########
@@ -952,15 +952,8 @@ async fn csv_query_window_with_order_by() -> Result<()> {
     let actual = execute(&mut ctx, sql).await;
     let expected = vec![

Review comment:
       ```
   -[ RECORD 1 ]-----------------------
   c9          | 28774375
   sum         | 61035129
   avg         | 61035129.000000000000
   count       | 1
   max         | 61035129
   min         | 61035129
   first_value | 61035129
   last_value  | 61035129
   nth_value   | ¤
   -[ RECORD 2 ]-----------------------
   c9          | 63044568
   sum         | -47938237
   avg         | -23969118.500000000000
   count       | 2
   max         | 61035129
   min         | -108973366
   first_value | 61035129
   last_value  | -108973366
   nth_value   | -108973366
   -[ RECORD 3 ]-----------------------
   c9          | 141047417
   sum         | 575165281
   avg         | 191721760.33333333
   count       | 3
   max         | 623103518
   min         | -108973366
   first_value | 61035129
   last_value  | 623103518
   nth_value   | -108973366
   -[ RECORD 4 ]-----------------------
   c9          | 141680161
   sum         | -1352462829
   avg         | -338115707.25000000
   count       | 4
   max         | 623103518
   min         | -1927628110
   first_value | 61035129
   last_value  | -1927628110
   nth_value   | -108973366
   -[ RECORD 5 ]-----------------------
   c9          | 145294611
   sum         | -3251637940
   avg         | -650327588.00000000
   count       | 5
   max         | 623103518
   min         | -1927628110
   first_value | 61035129
   last_value  | -1899175111
   nth_value   | -108973366
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb merged pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank

Posted by GitBox <gi...@apache.org>.
alamb merged pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org