You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/16 18:57:29 UTC

[GitHub] [arrow] westonpace commented on pull request #10118: ARROW-12468: Expose ScannerBuilder::UseAsync to python & R

westonpace commented on pull request #10118:
URL: https://github.com/apache/arrow/pull/10118#issuecomment-841861097


   This PR could use some advice from the R community.  I'm adding the ability to request async (at the moment async is a performance degredation in some cases when I/O is really fast so until we've made more progress there it will need to be optional)  I've added `UseAsync` to the scanner in R which is used, for example, like this:
   
   ```
   test_that("Scanner$ScanBatches", {
     ds <- open_dataset(ipc_dir, format = "feather")
     batches <- ds$NewScan()$Finish()$ScanBatches()
     table <- Table$create(!!!batches)
     expect_equivalent(as.data.frame(table), rbind(df1, df2))
   
     batches <- ds$NewScan()$UseAsync(TRUE)$Finish()$ScanBatches()
     table <- Table$create(!!!batches)
     expect_equivalent(as.data.frame(table), rbind(df1, df2))
   })
   ```
   However, most of the examples I see reading a dataset are doing something like...
   
   ```
   ds %>%
         select(string = chr, integer = int) %>%
         filter(integer > 6 & integer < 11) %>%
         collect() %>%
         summarize(mean = mean(integer))
   ```
   
   How should `UseAsync` be inserted into such a pattern (chain?) of calls.  Should it be it's own operator:
   
   ```
   ds %>%
         select(string = chr, integer = int) %>%
         filter(integer > 6 & integer < 11) %>%
         use_async() %>%
         collect() %>%
         summarize(mean = mean(integer))
   ```
   
   ...or an argument to `collect`:
   
   ```
   ds %>%
         select(string = chr, integer = int) %>%
         filter(integer > 6 & integer < 11) %>%
         collect(use_async=TRUE) %>%
         summarize(mean = mean(integer))
   ```
   ...or exposed some other different way?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org