You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/01/28 16:13:00 UTC
[jira] [Updated] (ARROW-11415) [R] experimental map_batches cannot
find columns
[ https://issues.apache.org/jira/browse/ARROW-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-11415:
------------------------------------
Summary: [R] experimental map_batches cannot find columns (was: experimental map_batches cannot find columns)
> [R] experimental map_batches cannot find columns
> ------------------------------------------------
>
> Key: ARROW-11415
> URL: https://issues.apache.org/jira/browse/ARROW-11415
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 2.0.0
> Reporter: Gabriel Bassett
> Priority: Minor
>
> With dataset:
>
> {code:java}
> Schema
> X3: timestamp[us]
> user_id: dictionary<values=string, indices=int32>
> classification_name: dictionary<values=string, indices=int32>
> X2: string
> X1: dictionary<values=string, indices=int32>
> X4: string
> X5: dictionary<values=string, indices=int32>
> X6: dictionary<values=string, indices=int32>
> {code}
> The following succeeds:
> {code:java}
> chunk <- ds %>%
> select(user_id) %>%
> collect() %>%
> count(user_id) %>%
> as_tibble() %>%
> count(user_id, wt=n)
> {code}
> While the following fails:
> {code:java}
> chunk <- ds %>%
> select(user_id) %>%
> arrow::map_batches(~count(., user_id)) %>%
> as_tibble() %>%
> count(user_id, wt=x)
> {code}
> With error:
> {code:java}
> Error: Can't subset columns that don't exist.
> ✖ Column `.drop` doesn't exist.
> Traceback:
> 1. ds %>% select(user_id) %>% arrow::map_batches(~count(.,
> . user_id)) %>% as_tibble() %>% count(user_id, wt = x)
> 2. count(., user_id, wt = x)
> 3. group_by(x, ..., .add = TRUE, .drop = .drop)
> 4. as_tibble(.)
> 5. arrow::map_batches(., ~count(., user_id))
> 6. lapply(scanner$Scan(), function(scan_task) {
> . lapply(scan_task$Execute(), function(batch) {
> . FUN(batch, ...)
> . })
> . })
> 7. map(.x, .f, ...)
> 8. .f(.x[[i]], ...)
> 9. lapply(scan_task$Execute(), function(batch) {
> . FUN(batch, ...)
> . })
> 10. map(.x, .f, ...)
> 11. .f(.x[[i]], ...)
> 12. FUN(batch, ...)
> 13. count(., user_id)
> 14. tally(out, wt = !!enquo(wt), sort = sort, name = name)
> 15. (function() {
> . old.options <- options(dplyr.summarise.inform = FALSE)
> . on.exit(options(old.options))
> . summarise(x, `:=`(!!name, !!n))
> . })()
> 16. summarise(x, `:=`(!!name, !!n))
> 17. summarise.arrow_dplyr_query(x, `:=`(!!name, !!n))
> 18. dplyr::select(.data, vars_to_keep)
> 19. select.arrow_dplyr_query(.data, vars_to_keep)
> 20. column_select(arrow_dplyr_query(.data), !!!enquos(...))
> 21. .FUN(names(.data), !!!enquos(...))
> 22. eval_select_impl(NULL, .vars, expr(c(!!!dots)), include = .include,
> . exclude = .exclude, strict = .strict, name_spec = unique_name_spec,
> . uniquely_named = TRUE)
> 23. with_subscript_errors(vars_select_eval(vars, expr, strict, data = x,
> . name_spec = name_spec, uniquely_named = uniquely_named, allow_rename = allow_rename,
> . type = type), type = type)
> 24. tryCatch(instrument_base_errors(expr), vctrs_error_subscript = function(cnd) {
> . cnd$subscript_action <- subscript_action(type)
> . cnd$subscript_elt <- "column"
> . cnd_signal(cnd)
> . })
> 25. tryCatchList(expr, classes, parentenv, handlers)
> 26. tryCatchOne(expr, names, parentenv, handlers[[1L]])
> 27. value[[3L]](cond)
> 28. cnd_signal(cnd)
> 29. rlang:::signal_abort(x)
> {code}
> The dataset is 8 parquet files with no hive partitioning.
>
> sessionInfo():
> {code:java}
> R version 4.0.3 (2020-10-10)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 18.04.3 LTSMatrix products: default
> BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
> LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.solocale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:
> [1] stats graphics grDevices utils datasets methods base other attached packages:
> [1] forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2 purrr_0.3.4
> [5] readr_1.4.0 tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2
> [9] tidyverse_1.3.0 dtplyr_1.0.1 data.table_1.13.2loaded via a namespace (and not attached):
> [1] Rcpp_1.0.5 lubridate_1.7.9.2 aws.ec2metadata_0.2.0
> [4] ps_1.5.0 arrow_2.0.0 assertthat_0.2.1
> [7] digest_0.6.27 utf8_1.1.4 aws.signature_0.6.0
> [10] mime_0.9 IRdisplay_0.7.0 R6_2.5.0
> [13] cellranger_1.1.0 repr_1.1.0 backports_1.2.0
> [16] reprex_0.3.0 evaluate_0.14 httr_1.4.2
> [19] pillar_1.4.7 rlang_0.4.9 curl_4.3
> [22] uuid_0.1-4 readxl_1.3.1 rstudioapi_0.13
> [25] bit_4.0.4 munsell_0.5.0 broom_0.7.2
> [28] compiler_4.0.3 modelr_0.1.8 pkgconfig_2.0.3
> [31] base64enc_0.1-3 htmltools_0.5.0 tidyselect_1.1.0
> [34] fansi_0.4.1 crayon_1.3.4 dbplyr_2.0.0
> [37] withr_2.3.0 grid_4.0.3 jsonlite_1.7.1
> [40] gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0
> [43] magrittr_2.0.1 scales_1.1.1 cli_2.2.0
> [46] stringi_1.5.3 fs_1.5.0 xml2_1.3.2
> [49] ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.5
> [52] IRkernel_1.1.1 tools_4.0.3 bit64_4.0.5
> [55] glue_1.4.2 hms_0.5.3 aws.s3_0.3.22
> [58] colorspace_2.0-0 rvest_0.3.6 pbdZMQ_0.3-3.1
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)