You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/22 17:35:03 UTC

[GitHub] [arrow] nealrichardson commented on pull request #7668: ARROW-6982: [R] Add bindings for compare and boolean kernels

nealrichardson commented on pull request #7668:
URL: https://github.com/apache/arrow/pull/7668#issuecomment-662587689


   Here's an informal benchmark that shows the benefit of pushing all of this work down into Arrow, reproducing how the "old" way (on current master) calls `as.vector` on all Arrays before doing any comparisons or aggregations. There's a 4-5x speedup doing the comparison, filtering, and aggregation in Arrow, even with eager evaluation:
   
   ```r
   library(arrow)
   tab <- read_parquet("nyc-taxi/2019/06/data.parquet", as_data_frame = FALSE)
   dim(tab)
   ## [1] 6941024      18
   
   bench::mark(
     new = as.vector(mean(tab$fare_amount[tab$trip_distance > 1 & tab$passenger_count < 4], na.rm = TRUE)),
     old = mean(as.vector(tab$fare_amount[as.vector(tab$trip_distance) > 1 & as.vector(tab$passenger_count) < 4]), na.rm = TRUE)
   )
   ## # A tibble: 2 x 13
   ##   expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
   ##   <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
   ## 1 new         47.4ms  47.7ms     17.6     10.2KB     1.95     9     1      512ms
   ## 2 old        207.3ms 213.8ms      4.70   327.6MB    12.5      3     8      638ms
   ## # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org