You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Sam Albers (Jira)" <ji...@apache.org> on 2021/03/10 19:25:00 UTC
[jira] [Created] (ARROW-11925) Add `between` method for
arrow_dplyr_query
Sam Albers created ARROW-11925:
----------------------------------
Summary: Add `between` method for arrow_dplyr_query
Key: ARROW-11925
URL: https://issues.apache.org/jira/browse/ARROW-11925
Project: Apache Arrow
Issue Type: New Feature
Components: R
Reporter: Sam Albers
Would you consider a PR to add a between method for `arrow_dplyr_query` objects? Even something implemented directly in R harnesses the arrow speed. Here is what I am thinking:
Typical usage of `between`:
{code:java}
library(dplyr)
library(arrow)
iris %>% filter(between(Petal.Length, 1, 1.1)){code}
Here is a mocked up version of the method:
{code:java}
between_mock <- function(x, left, right) {
if (length(left) != 1) {
rlang::abort("`left` must be length 1")
}
if (length(right) != 1) {
rlang::abort("`right` must be length 1")
}x >= left & x <= right
}{code}
I think because `dplyr` uses C++ to efficiently do this, `between` doesn't work out of the box:
{code:java}
open_dataset("nyc-taxi", partitioning = "year") %>%
filter(year == 2014) %>%
select(year, fare_amount) %>%
filter(between(fare_amount, 10, 11)) %>%
collect()
Error: Filter expression not supported for Arrow Datasets: between(fare_amount, 10, 11)
Call collect() first to pull data into R.
In addition: Warning message:
between() called on numeric vector with S3 class
Backtrace:
x
1. +-[ `%>%`(...) ]
2. +-[ dplyr::collect(...) ]
3. +-[ dplyr::filter(...) ]
4. \-arrow:::filter.arrow_dplyr_query(...){code}
But even my simple implementation works fine:
{code:java}
open_dataset("nyc-taxi", partitioning = "year") %>%
filter(year == 2014) %>%
select(year, fare_amount) %>%
filter(between_mock(fare_amount, 10, 11)) %>%
collect() {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)