You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/03/15 20:52:00 UTC

[jira] [Resolved] (ARROW-11925) [R] Add `between` method for arrow_dplyr_query

     [ https://issues.apache.org/jira/browse/ARROW-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson resolved ARROW-11925.
-------------------------------------
    Resolution: Fixed

Issue resolved by pull request 9674
[https://github.com/apache/arrow/pull/9674]

> [R] Add `between` method for arrow_dplyr_query
> ----------------------------------------------
>
>                 Key: ARROW-11925
>                 URL: https://issues.apache.org/jira/browse/ARROW-11925
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: R
>            Reporter: Sam Albers
>            Assignee: Sam Albers
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Would you consider a PR to add a between method for `arrow_dplyr_query` objects? Even something implemented directly in R harnesses the arrow speed. Here is what I am thinking:
> Typical usage of `between`:
>  
> {code:java}
> library(dplyr)
> library(arrow)
> iris %>% filter(between(Petal.Length, 1, 1.1)){code}
>  
>  Here is a mocked up version of the method:
>  
> {code:java}
> between_mock <- function(x, left, right) {
> if (length(left) != 1) {
>  rlang::abort("`left` must be length 1")
>  }
>  if (length(right) != 1) {
>  rlang::abort("`right` must be length 1")
>  }x >= left & x <= right
> }{code}
> I think because `dplyr` uses C++ to efficiently do this, `between` doesn't work out of the box:
> {code:java}
> open_dataset("nyc-taxi", partitioning = "year") %>% 
>  filter(year == 2014) %>% 
>  select(year, fare_amount) %>% 
>  filter(between(fare_amount, 10, 11)) %>% 
>  collect() 
> Error: Filter expression not supported for Arrow Datasets: between(fare_amount, 10, 11)
> Call collect() first to pull data into R.
> In addition: Warning message:
>  between() called on numeric vector with S3 class 
> Backtrace:
>  x
> 1. +-[ `%>%`(...) ]
> 2. +-[ dplyr::collect(...) ]
> 3. +-[ dplyr::filter(...) ]
> 4. \-arrow:::filter.arrow_dplyr_query(...){code}
> But even my simple implementation works fine:
> {code:java}
> open_dataset("nyc-taxi", partitioning = "year") %>% 
>  filter(year == 2014) %>% 
>  select(year, fare_amount) %>% 
>  filter(between_mock(fare_amount, 10, 11)) %>% 
>  collect() {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)