You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/07/11 19:37:00 UTC
[jira] [Updated] (ARROW-16630) [C++] Proper BottomK support in SinkNode

     [ https://issues.apache.org/jira/browse/ARROW-16630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-16630:
------------------------------------
    Description: 
BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.

Other consideration: we've been using TopK as a (theoretically) efficient way to do `arrange() %>% head()`, and so BottomK would do `arrange() %>% tail()`. But this intersects with differences in null-handling in sorting (ARROW-12063). 

Example in R:

{code}
> df <- data.frame(x = c(2, 4, 1, NA, 3))
# as ian says on ARROW-14085, dplyr puts NAs last in sorting, always
> df %>% arrange(x)
   x
1  1
2  2
3  3
4  4
5 NA
> df %>% arrange(desc(x))
   x
1  4
2  3
3  2
4  1
5 NA

# So when you arrange %>% head/tail, you get values based on that:
> df %>% arrange(x) %>% head(1)
  x
1 1
> df %>% arrange(x) %>% tail(1)
   x
5 NA

# We sort like that in arrow too:
> df %>% arrow_table() %>% arrange(x) %>% collect()
   x
1  1
2  2
3  3
4  4
5 NA
> df %>% arrow_table() %>% arrange(desc(x)) %>% collect()
   x
1  4
2  3
3  2
4  1
5 NA

# But since we implement arrange(x) %>% head as TopK(x) and arrange(x) %>% tail as TopK(desc(x)),
# we don't get the same tail value:
> df %>% arrow_table() %>% arrange(x) %>% head(1) %>% collect()
  x
1 1
> df %>% arrow_table() %>% arrange(x) %>% tail(1) %>% collect()
  x
1 4
{code}

  was:BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.


> [C++] Proper BottomK support in SinkNode
> ----------------------------------------
>
>                 Key: ARROW-16630
>                 URL: https://issues.apache.org/jira/browse/ARROW-16630
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Neal Richardson
>            Priority: Major
>              Labels: query-engine
>
> BottomK is implemented as TopK on reverse-sorted data. You get the rows you wanted, but the problem is that they're in reversed order.
> Other consideration: we've been using TopK as a (theoretically) efficient way to do `arrange() %>% head()`, and so BottomK would do `arrange() %>% tail()`. But this intersects with differences in null-handling in sorting (ARROW-12063). 
> Example in R:
> {code}
> > df <- data.frame(x = c(2, 4, 1, NA, 3))
> # as ian says on ARROW-14085, dplyr puts NAs last in sorting, always
> > df %>% arrange(x)
>    x
> 1  1
> 2  2
> 3  3
> 4  4
> 5 NA
> > df %>% arrange(desc(x))
>    x
> 1  4
> 2  3
> 3  2
> 4  1
> 5 NA
> # So when you arrange %>% head/tail, you get values based on that:
> > df %>% arrange(x) %>% head(1)
>   x
> 1 1
> > df %>% arrange(x) %>% tail(1)
>    x
> 5 NA
> # We sort like that in arrow too:
> > df %>% arrow_table() %>% arrange(x) %>% collect()
>    x
> 1  1
> 2  2
> 3  3
> 4  4
> 5 NA
> > df %>% arrow_table() %>% arrange(desc(x)) %>% collect()
>    x
> 1  4
> 2  3
> 3  2
> 4  1
> 5 NA
> # But since we implement arrange(x) %>% head as TopK(x) and arrange(x) %>% tail as TopK(desc(x)),
> # we don't get the same tail value:
> > df %>% arrow_table() %>% arrange(x) %>% head(1) %>% collect()
>   x
> 1 1
> > df %>% arrow_table() %>% arrange(x) %>% tail(1) %>% collect()
>   x
> 1 4
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)