You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/01/28 16:13:00 UTC

[jira] [Updated] (ARROW-11415) [R] experimental map_batches cannot find columns

     [ https://issues.apache.org/jira/browse/ARROW-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-11415:
------------------------------------
    Summary: [R] experimental map_batches cannot find columns  (was: experimental map_batches cannot find columns)

> [R] experimental map_batches cannot find columns
> ------------------------------------------------
>
>                 Key: ARROW-11415
>                 URL: https://issues.apache.org/jira/browse/ARROW-11415
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 2.0.0
>            Reporter: Gabriel Bassett
>            Priority: Minor
>
> With dataset:
>  
> {code:java}
> Schema
> X3: timestamp[us]
> user_id: dictionary<values=string, indices=int32>
> classification_name: dictionary<values=string, indices=int32>
> X2: string
> X1: dictionary<values=string, indices=int32>
> X4: string
> X5: dictionary<values=string, indices=int32>
> X6: dictionary<values=string, indices=int32>
> {code}
> The following succeeds:
> {code:java}
> chunk <- ds %>%
>     select(user_id) %>%
>     collect() %>%
>     count(user_id) %>%
>     as_tibble() %>%
>     count(user_id, wt=n)
> {code}
> While the following fails:
> {code:java}
> chunk <- ds %>%
>     select(user_id) %>%
>     arrow::map_batches(~count(., user_id)) %>%
>     as_tibble() %>%
>     count(user_id, wt=x)
> {code}
> With error:
> {code:java}
> Error: Can't subset columns that don't exist.
> ✖ Column `.drop` doesn't exist.
> Traceback:
> 1. ds %>% select(user_id) %>% arrow::map_batches(~count(., 
>  .     user_id)) %>% as_tibble() %>% count(user_id, wt = x)
> 2. count(., user_id, wt = x)
> 3. group_by(x, ..., .add = TRUE, .drop = .drop)
> 4. as_tibble(.)
> 5. arrow::map_batches(., ~count(., user_id))
> 6. lapply(scanner$Scan(), function(scan_task) {
>  .     lapply(scan_task$Execute(), function(batch) {
>  .         FUN(batch, ...)
>  .     })
>  . })
> 7. map(.x, .f, ...)
> 8. .f(.x[[i]], ...)
> 9. lapply(scan_task$Execute(), function(batch) {
>  .     FUN(batch, ...)
>  . })
> 10. map(.x, .f, ...)
> 11. .f(.x[[i]], ...)
> 12. FUN(batch, ...)
> 13. count(., user_id)
> 14. tally(out, wt = !!enquo(wt), sort = sort, name = name)
> 15. (function() {
>   .     old.options <- options(dplyr.summarise.inform = FALSE)
>   .     on.exit(options(old.options))
>   .     summarise(x, `:=`(!!name, !!n))
>   . })()
> 16. summarise(x, `:=`(!!name, !!n))
> 17. summarise.arrow_dplyr_query(x, `:=`(!!name, !!n))
> 18. dplyr::select(.data, vars_to_keep)
> 19. select.arrow_dplyr_query(.data, vars_to_keep)
> 20. column_select(arrow_dplyr_query(.data), !!!enquos(...))
> 21. .FUN(names(.data), !!!enquos(...))
> 22. eval_select_impl(NULL, .vars, expr(c(!!!dots)), include = .include, 
>   .     exclude = .exclude, strict = .strict, name_spec = unique_name_spec, 
>   .     uniquely_named = TRUE)
> 23. with_subscript_errors(vars_select_eval(vars, expr, strict, data = x, 
>   .     name_spec = name_spec, uniquely_named = uniquely_named, allow_rename = allow_rename, 
>   .     type = type), type = type)
> 24. tryCatch(instrument_base_errors(expr), vctrs_error_subscript = function(cnd) {
>   .     cnd$subscript_action <- subscript_action(type)
>   .     cnd$subscript_elt <- "column"
>   .     cnd_signal(cnd)
>   . })
> 25. tryCatchList(expr, classes, parentenv, handlers)
> 26. tryCatchOne(expr, names, parentenv, handlers[[1L]])
> 27. value[[3L]](cond)
> 28. cnd_signal(cnd)
> 29. rlang:::signal_abort(x)
> {code}
> The dataset is 8 parquet files with no hive partitioning.
>  
> sessionInfo():
> {code:java}
> R version 4.0.3 (2020-10-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.3 LTSMatrix products: default
> BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.solocale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     other attached packages:
>  [1] forcats_0.5.0     stringr_1.4.0     dplyr_1.0.2       purrr_0.3.4      
>  [5] readr_1.4.0       tidyr_1.1.2       tibble_3.0.4      ggplot2_3.3.2    
>  [9] tidyverse_1.3.0   dtplyr_1.0.1      data.table_1.13.2loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.5            lubridate_1.7.9.2     aws.ec2metadata_0.2.0
>  [4] ps_1.5.0              arrow_2.0.0           assertthat_0.2.1     
>  [7] digest_0.6.27         utf8_1.1.4            aws.signature_0.6.0  
> [10] mime_0.9              IRdisplay_0.7.0       R6_2.5.0             
> [13] cellranger_1.1.0      repr_1.1.0            backports_1.2.0      
> [16] reprex_0.3.0          evaluate_0.14         httr_1.4.2           
> [19] pillar_1.4.7          rlang_0.4.9           curl_4.3             
> [22] uuid_0.1-4            readxl_1.3.1          rstudioapi_0.13      
> [25] bit_4.0.4             munsell_0.5.0         broom_0.7.2          
> [28] compiler_4.0.3        modelr_0.1.8          pkgconfig_2.0.3      
> [31] base64enc_0.1-3       htmltools_0.5.0       tidyselect_1.1.0     
> [34] fansi_0.4.1           crayon_1.3.4          dbplyr_2.0.0         
> [37] withr_2.3.0           grid_4.0.3            jsonlite_1.7.1       
> [40] gtable_0.3.0          lifecycle_0.2.0       DBI_1.1.0            
> [43] magrittr_2.0.1        scales_1.1.1          cli_2.2.0            
> [46] stringi_1.5.3         fs_1.5.0              xml2_1.3.2           
> [49] ellipsis_0.3.1        generics_0.1.0        vctrs_0.3.5          
> [52] IRkernel_1.1.1        tools_4.0.3           bit64_4.0.5          
> [55] glue_1.4.2            hms_0.5.3             aws.s3_0.3.22        
> [58] colorspace_2.0-0      rvest_0.3.6           pbdZMQ_0.3-3.1
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)