You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/28 01:52:38 UTC
[GitHub] [arrow-datafusion] Jimexist opened a new pull request #638: add integration tests for rank, dense_rank
Jimexist opened a new pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638
# Which issue does this PR close?
add integration tests for rank, dense_rank
Closes follow ups on #555
# Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
-->
# What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
-->
# Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please add the `api change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a change in pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638#discussion_r660878629
##########
File path: datafusion/src/physical_plan/expressions/nth_value.rs
##########
@@ -138,21 +140,56 @@ pub(crate) struct NthValueEvaluator {
}
impl PartitionEvaluator for NthValueEvaluator {
- fn evaluate_partition(&self, partition: Range<usize>) -> Result<ArrayRef> {
- let value = &self.values[0];
+ fn include_rank(&self) -> bool {
+ true
+ }
+
+ fn evaluate_partition(&self, _partition: Range<usize>) -> Result<ArrayRef> {
+ unreachable!("first, last, and nth_value evaluation must be called with evaluate_partition_with_rank")
+ }
+
+ fn evaluate_partition_with_rank(
+ &self,
+ partition: Range<usize>,
+ ranks_in_partition: &[Range<usize>],
+ ) -> Result<ArrayRef> {
+ let arr = &self.values[0];
let num_rows = partition.end - partition.start;
- let value = value.slice(partition.start, num_rows);
- let index: usize = match self.kind {
- NthValueKind::First => 0,
- NthValueKind::Last => (num_rows as usize) - 1,
- NthValueKind::Nth(n) => (n as usize) - 1,
- };
- Ok(if index >= num_rows {
- new_null_array(value.data_type(), num_rows)
- } else {
- let value = ScalarValue::try_from_array(&value, index)?;
- value.to_array_of_size(num_rows)
- })
+ match self.kind {
+ NthValueKind::First => {
+ let value = ScalarValue::try_from_array(arr, partition.start)?;
+ Ok(value.to_array_of_size(num_rows))
+ }
+ NthValueKind::Last => {
+ // because the default window frame is between unbounded preceding and current
+ // row with peer evaluation, hence the last rows expands until the end of the peers
+ let values = ranks_in_partition
+ .iter()
+ .map(|range| {
+ let len = range.end - range.start;
+ let value = ScalarValue::try_from_array(arr, range.end - 1)?;
+ Ok(iter::repeat(value).take(len))
+ })
+ .collect::<Result<Vec<_>>>()?
+ .into_iter()
+ .flatten();
+ ScalarValue::iter_to_array(values)
+ }
+ NthValueKind::Nth(n) => {
+ let index = (n as usize) - 1;
+ if index >= num_rows {
+ Ok(new_null_array(arr.data_type(), num_rows))
+ } else {
+ let value =
+ ScalarValue::try_from_array(arr, partition.start + index)?;
+ let arr = value.to_array_of_size(num_rows);
+ // because the default window frame is between unbounded preceding and current
+ // row, hence the shift because for values with indices < index they should be
+ // null. This changes when window frames other than default is implemented
+ shift(arr.as_ref(), index as i64).map_err(DataFusionError::ArrowError)
Review comment:
👍
##########
File path: datafusion/src/physical_plan/expressions/nth_value.rs
##########
@@ -138,21 +140,56 @@ pub(crate) struct NthValueEvaluator {
}
impl PartitionEvaluator for NthValueEvaluator {
- fn evaluate_partition(&self, partition: Range<usize>) -> Result<ArrayRef> {
- let value = &self.values[0];
+ fn include_rank(&self) -> bool {
+ true
+ }
+
+ fn evaluate_partition(&self, _partition: Range<usize>) -> Result<ArrayRef> {
+ unreachable!("first, last, and nth_value evaluation must be called with evaluate_partition_with_rank")
+ }
+
+ fn evaluate_partition_with_rank(
+ &self,
+ partition: Range<usize>,
+ ranks_in_partition: &[Range<usize>],
+ ) -> Result<ArrayRef> {
+ let arr = &self.values[0];
let num_rows = partition.end - partition.start;
- let value = value.slice(partition.start, num_rows);
- let index: usize = match self.kind {
- NthValueKind::First => 0,
- NthValueKind::Last => (num_rows as usize) - 1,
- NthValueKind::Nth(n) => (n as usize) - 1,
- };
- Ok(if index >= num_rows {
- new_null_array(value.data_type(), num_rows)
- } else {
- let value = ScalarValue::try_from_array(&value, index)?;
- value.to_array_of_size(num_rows)
- })
+ match self.kind {
+ NthValueKind::First => {
+ let value = ScalarValue::try_from_array(arr, partition.start)?;
+ Ok(value.to_array_of_size(num_rows))
+ }
+ NthValueKind::Last => {
+ // because the default window frame is between unbounded preceding and current
+ // row with peer evaluation, hence the last rows expands until the end of the peers
+ let values = ranks_in_partition
+ .iter()
+ .map(|range| {
+ let len = range.end - range.start;
+ let value = ScalarValue::try_from_array(arr, range.end - 1)?;
Review comment:
this is very cool
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank
Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638#discussion_r660227050
##########
File path: datafusion/tests/sql.rs
##########
@@ -952,15 +952,8 @@ async fn csv_query_window_with_order_by() -> Result<()> {
let actual = execute(&mut ctx, sql).await;
let expected = vec![
Review comment:
```
-[ RECORD 1 ]-----------------------
c9 | 28774375
sum | 61035129
avg | 61035129.000000000000
count | 1
max | 61035129
min | 61035129
first_value | 61035129
last_value | 61035129
nth_value | ¤
-[ RECORD 2 ]-----------------------
c9 | 63044568
sum | -47938237
avg | -23969118.500000000000
count | 2
max | 61035129
min | -108973366
first_value | 61035129
last_value | -108973366
nth_value | -108973366
-[ RECORD 3 ]-----------------------
c9 | 141047417
sum | 575165281
avg | 191721760.33333333
count | 3
max | 623103518
min | -108973366
first_value | 61035129
last_value | 623103518
nth_value | -108973366
-[ RECORD 4 ]-----------------------
c9 | 141680161
sum | -1352462829
avg | -338115707.25000000
count | 4
max | 623103518
min | -1927628110
first_value | 61035129
last_value | -1927628110
nth_value | -108973366
-[ RECORD 5 ]-----------------------
c9 | 145294611
sum | -3251637940
avg | -650327588.00000000
count | 5
max | 623103518
min | -1927628110
first_value | 61035129
last_value | -1899175111
nth_value | -108973366
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Jimexist commented on pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank
Posted by GitBox <gi...@apache.org>.
Jimexist commented on pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638#issuecomment-870594602
@alamb and @Dandandan this pull request is ready now
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] Jimexist commented on a change in pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank
Posted by GitBox <gi...@apache.org>.
Jimexist commented on a change in pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638#discussion_r660227050
##########
File path: datafusion/tests/sql.rs
##########
@@ -952,15 +952,8 @@ async fn csv_query_window_with_order_by() -> Result<()> {
let actual = execute(&mut ctx, sql).await;
let expected = vec![
Review comment:
```
-[ RECORD 1 ]-----------------------
c9 | 28774375
sum | 61035129
avg | 61035129.000000000000
count | 1
max | 61035129
min | 61035129
first_value | 61035129
last_value | 61035129
nth_value | ¤
-[ RECORD 2 ]-----------------------
c9 | 63044568
sum | -47938237
avg | -23969118.500000000000
count | 2
max | 61035129
min | -108973366
first_value | 61035129
last_value | -108973366
nth_value | -108973366
-[ RECORD 3 ]-----------------------
c9 | 141047417
sum | 575165281
avg | 191721760.33333333
count | 3
max | 623103518
min | -108973366
first_value | 61035129
last_value | 623103518
nth_value | -108973366
-[ RECORD 4 ]-----------------------
c9 | 141680161
sum | -1352462829
avg | -338115707.25000000
count | 4
max | 623103518
min | -1927628110
first_value | 61035129
last_value | -1927628110
nth_value | -108973366
-[ RECORD 5 ]-----------------------
c9 | 145294611
sum | -3251637940
avg | -650327588.00000000
count | 5
max | 623103518
min | -1927628110
first_value | 61035129
last_value | -1899175111
nth_value | -108973366
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb merged pull request #638: add integration tests for rank, dense_rank, fix last_value evaluation with rank
Posted by GitBox <gi...@apache.org>.
alamb merged pull request #638:
URL: https://github.com/apache/arrow-datafusion/pull/638
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org