You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/07/11 19:37:00 UTC
[jira] [Updated] (ARROW-16630) [C++] Proper BottomK support in SinkNode
[ https://issues.apache.org/jira/browse/ARROW-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-16630:
------------------------------------
Description:
BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.
Other consideration: we've been using TopK as a (theoretically) efficient way to do `arrange() %>% head()`, and so BottomK would do `arrange() %>% tail()`. But this intersects with differences in null-handling in sorting (ARROW-12063).
Example in R:
{code}
> df <- data.frame(x = c(2, 4, 1, NA, 3))
# as ian says on ARROW-14085, dplyr puts NAs last in sorting, always
> df %>% arrange(x)
x
1 1
2 2
3 3
4 4
5 NA
> df %>% arrange(desc(x))
x
1 4
2 3
3 2
4 1
5 NA
# So when you arrange %>% head/tail, you get values based on that:
> df %>% arrange(x) %>% head(1)
x
1 1
> df %>% arrange(x) %>% tail(1)
x
5 NA
# We sort like that in arrow too:
> df %>% arrow_table() %>% arrange(x) %>% collect()
x
1 1
2 2
3 3
4 4
5 NA
> df %>% arrow_table() %>% arrange(desc(x)) %>% collect()
x
1 4
2 3
3 2
4 1
5 NA
# But since we implement arrange(x) %>% head as TopK(x) and arrange(x) %>% tail as TopK(desc(x)),
# we don't get the same tail value:
> df %>% arrow_table() %>% arrange(x) %>% head(1) %>% collect()
x
1 1
> df %>% arrow_table() %>% arrange(x) %>% tail(1) %>% collect()
x
1 4
{code}
was:BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.
> [C++] Proper BottomK support in SinkNode
> ----------------------------------------
>
> Key: ARROW-16630
> URL: https://issues.apache.org/jira/browse/ARROW-16630
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Neal Richardson
> Priority: Major
> Labels: query-engine
>
> BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.
> Other consideration: we've been using TopK as a (theoretically) efficient way to do `arrange() %>% head()`, and so BottomK would do `arrange() %>% tail()`. But this intersects with differences in null-handling in sorting (ARROW-12063).
> Example in R:
> {code}
> > df <- data.frame(x = c(2, 4, 1, NA, 3))
> # as ian says on ARROW-14085, dplyr puts NAs last in sorting, always
> > df %>% arrange(x)
> x
> 1 1
> 2 2
> 3 3
> 4 4
> 5 NA
> > df %>% arrange(desc(x))
> x
> 1 4
> 2 3
> 3 2
> 4 1
> 5 NA
> # So when you arrange %>% head/tail, you get values based on that:
> > df %>% arrange(x) %>% head(1)
> x
> 1 1
> > df %>% arrange(x) %>% tail(1)
> x
> 5 NA
> # We sort like that in arrow too:
> > df %>% arrow_table() %>% arrange(x) %>% collect()
> x
> 1 1
> 2 2
> 3 3
> 4 4
> 5 NA
> > df %>% arrow_table() %>% arrange(desc(x)) %>% collect()
> x
> 1 4
> 2 3
> 3 2
> 4 1
> 5 NA
> # But since we implement arrange(x) %>% head as TopK(x) and arrange(x) %>% tail as TopK(desc(x)),
> # we don't get the same tail value:
> > df %>% arrow_table() %>% arrange(x) %>% head(1) %>% collect()
> x
> 1 1
> > df %>% arrow_table() %>% arrange(x) %>% tail(1) %>% collect()
> x
> 1 4
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)