You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/22 17:35:03 UTC
[GitHub] [arrow] nealrichardson commented on pull request #7668: ARROW-6982: [R] Add bindings for compare and boolean kernels
nealrichardson commented on pull request #7668:
URL: https://github.com/apache/arrow/pull/7668#issuecomment-662587689
Here's an informal benchmark that shows the benefit of pushing all of this work down into Arrow, reproducing how the "old" way (on current master) calls `as.vector` on all Arrays before doing any comparisons or aggregations. There's a 4-5x speedup doing the comparison, filtering, and aggregation in Arrow, even with eager evaluation:
```r
library(arrow)
tab <- read_parquet("nyc-taxi/2019/06/data.parquet", as_data_frame = FALSE)
dim(tab)
## [1] 6941024 18
bench::mark(
new = as.vector(mean(tab$fare_amount[tab$trip_distance > 1 & tab$passenger_count < 4], na.rm = TRUE)),
old = mean(as.vector(tab$fare_amount[as.vector(tab$trip_distance) > 1 & as.vector(tab$passenger_count) < 4]), na.rm = TRUE)
)
## # A tibble: 2 x 13
## expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
## <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
## 1 new 47.4ms 47.7ms 17.6 10.2KB 1.95 9 1 512ms
## 2 old 207.3ms 213.8ms 4.70 327.6MB 12.5 3 8 638ms
## # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org