You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "mikerspencer (via GitHub)" <gi...@apache.org> on 2023/11/30 11:15:11 UTC

[I] R: altrep data type slows down evaluation [arrow]

mikerspencer opened a new issue, #39004:
URL: https://github.com/apache/arrow/issues/39004

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   When making evaluations like checking for NA the altrep datatype slows calculation by approx four times. Tested in arrow 10, 12 & 14 on Ubuntu.
   
   ```
   library(arrow)
   library(dplyr)
   
   # generate data
   x = runif(29500000) * 10
   d = data.frame(cv = x)
   write_dataset(d, "/tmp/data.arrow")
   # then read back
   df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
   x = df$cv
   y = x + 0
   
   identical(x, y)
   microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})
   ```
   
   Results:
   
   Unit: milliseconds
   | expr | min | lq | mean | median | uq | max neval |
   |---|---|---|---|---|---|---|
   |x | 291.8 | 302.2 | 348.8 | 310.2 | 348.8 | 754.8 | 100 |
   | y | 85.3  | 87.2 | 108.8  | 89.3 | 133.4 | 225.4 | 100 |
   
   With thanks to Barry for the reprex https://mastodon.scot/@geospacedman@mastodon.social/111450704657241188
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [R] altrep data type slows down evaluation [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on issue #39004:
URL: https://github.com/apache/arrow/issues/39004#issuecomment-1875615355

   I check to make sure that nothing unepxected is happening (e.g., we had an issue before where we were materializing the entire array by accident for each call to `Elt()`), and nothing seems to be amiss: the underlying implementation is calling `ISNAN(REAL_ELT(x))` (or similar, I didn't check) a lot of times for ALTREP objects. For us, that's very slow.
   
   A better implementation might call `REAL_GET_REGION()`. If it did, the ALTREP implementation would be slower but not nearly as bad as extracting each element individually.
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   x <- runif(29500000) * 10
   x_altrep <- as.vector(as_chunked_array(x))
   .Internal(inspect(x_altrep))
   #> @1064c4180 14 REALSXP g0c0 [REF(65535)] arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000
   
   # Probably a better implementation than base R's
   cpp11::cpp_function("
   cpp11::logicals is_na2(cpp11::doubles x) {
       int region_size = 1024;
       R_xlen_t n = x.size();
       cpp11::writable::logicals out(n);
       cpp11::writable::doubles buf_shelter(region_size);
       double* buf = REAL(buf_shelter);
       for (R_xlen_t i = 0; i < n; i++) {
         if ((i % region_size) == 0) {
           REAL_GET_REGION(x, i, region_size, buf);
         }
         out[i] = ISNAN(buf[i % region_size]);
       }
       return out;
   }                    
   ")
   
   bench::mark(
     is.na(x),
     is.na(x_altrep),
     is_na2(x),
     is_na2(x_altrep)
   )
   #> # A tibble: 4 × 6
   #>   expression            min   median `itr/sec` mem_alloc `gc/sec`
   #>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
   #> 1 is.na(x)           23.8ms   24.1ms     40.6      113MB    14.5 
   #> 2 is.na(x_altrep)   315.7ms  317.6ms      3.15     113MB     0   
   #> 3 is_na2(x)          60.8ms   61.4ms     16.3      113MB     5.44
   #> 4 is_na2(x_altrep)     62ms   62.3ms     16.0      113MB     5.35
   
   # Make sure we didn't materialize
   .Internal(inspect(x_altrep))
   #> @1064c4180 14 REALSXP g1c0 [MARK,REF(65535)] arrow::array_dbl_vector<0x13570c5b8, double, 1 chunks, 0 nulls> len=29500000
   ```
   
   <sup>Created on 2024-01-03 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
   
   It does beg the question of whether ALTREP by default is worth the trouble.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [R] altrep data type slows down evaluation [arrow]

Posted by "mikerspencer (via GitHub)" <gi...@apache.org>.

mikerspencer commented on issue #39004:
URL: https://github.com/apache/arrow/issues/39004#issuecomment-1836538963

   That's great, thanks! I get a slightly quicker response now from the arrow var:
   
   Unit: milliseconds
   | expr | min | lq | mean | median | uq | max neval |
   | x | 87.89557 | 96.76203 | 121.5635 | 117.6893 | 133.8556 | 252.2526  | 100 |
   | y 88.43664  | 101.78672 | 129.5208 | 120.7926 | 151.2396 | 281.5310  | 100 |
   
   Getting into hardware, I suspect you're on Apple silicon with those times (but maybe much faster storage!). It's interesting you don't see a difference between the two methods, but on my AMD machine it's now quicker with the arrow var.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] R: altrep data type slows down evaluation [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on issue #39004:
URL: https://github.com/apache/arrow/issues/39004#issuecomment-1834306130

   Thanks for opening the issue, and thanks for the reprex!
   
   It is true that ALTREP objects generally perform more slowly than non-ALTREP objects, although I wouldn't have expected this particular operation to be that much slower.
   
   I will dig into this, but in the meantime, you can turn ALTREP off using `options(arrow.use_altrep = FALSE)`:
   
   ``` r
   library(arrow)
   library(dplyr)
   options(arrow.use_altrep = FALSE)
   
   # generate data
   x = runif(29500000) * 10
   d = data.frame(cv = x)
   write_dataset(d, "/tmp/data.arrow")
   # then read back
   df = open_dataset("/tmp/data.arrow/") %>% select(cv) %>% collect()
   x = df$cv
   y = x + 0
   
   identical(x, y)
   #> [1] TRUE
   microbenchmark::microbenchmark(x={sum(is.na(x))}, y={sum(is.na(y))})
   #> Warning in microbenchmark::microbenchmark(x = {: less accurate nanosecond times
   #> to avoid potential integer overflows
   #> Unit: milliseconds
   #>  expr      min       lq     mean   median       uq       max neval
   #>     x 41.99819 42.61737 46.73451 46.07767 46.77581 106.51480   100
   #>     y 41.97875 42.59027 46.16944 46.06932 46.63008  67.24804   100
   ```
   
   <sup>Created on 2023-11-30 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [R] altrep data type slows down evaluation [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on issue #39004:
URL: https://github.com/apache/arrow/issues/39004#issuecomment-1875580374

   Slightly more minimal reprex:
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
   
   x <- runif(29500000) * 10
   x_altrep <- as.vector(as_chunked_array(x))
   
   bench::mark(
     is.na(x),
     is.na(x_altrep)
   )
   #> # A tibble: 2 × 6
   #>   expression           min   median `itr/sec` mem_alloc `gc/sec`
   #>   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
   #> 1 is.na(x)          23.7ms   24.1ms     41.3      113MB     15.9
   #> 2 is.na(x_altrep)  244.4ms  244.5ms      4.09     113MB      0
   ```
   
   <sup>Created on 2024-01-03 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org