You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/12/06 09:39:00 UTC
[jira] [Commented] (ARROW-18372) [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell
[ https://issues.apache.org/jira/browse/ARROW-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643753#comment-17643753 ]
Nicola Crane commented on ARROW-18372:
--------------------------------------
Thanks for reporting this [~lucasmation]. This might be a bit of a tricky one to pin down; do you know if this was an issue in previous versions of Arrow or just 10.0.0; are you able to install 9.0.0 and test it there, so we can check if it's a regression or an existing bug?
> [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell
> ------------------------------------------------------------------------------------------------------
>
> Key: ARROW-18372
> URL: https://issues.apache.org/jira/browse/ARROW-18372
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 10.0.0
> Reporter: Lucas Mation
> Priority: Major
>
> I have a large parquet file 900 million rows , 40cols parquet file, subdivided into folders for each year. I was trying to calculate how many unique combinations of id1+id2+id3+id4 there are in the dataset.
>
> Notice that the "collected" dataset is supposed to be only one row and one cel, containing the count (I've confirmed this by subseting the dataset ("%>% head(10^6)" ) before computing the count, and it works). That is why the error below is so weird
> ```
> fa <- 'myparteq folder' #huge
> va <- open_dataset(fa)
> tic()
> d <- va %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect
> toc()
>
> Error in `collect()`:
> ! Invalid: negative malloc size
> Run `rlang::last_error()` to see where the error occurred.
>
> > rlang::last_error()
> <error/rlang_error>
> Error in `collect()`:
> ! Invalid: negative malloc size
> ---
> Backtrace:
> 1. ... %>% collect
> 3. arrow:::collect.arrow_dplyr_query(.)
> Run `rlang::last_trace()` to see the full context.
>
> > rlang::last_trace()
> <error/rlang_error>
> Error in `collect()`:
> ! Invalid: negative malloc size
> ---
> Backtrace:
> x
> 1. +-... %>% collect
> 2. +-dplyr::collect(.)
> 3. \-arrow:::collect.arrow_dplyr_query(.)
> 4. \-base::tryCatch(...)
> 5. \-base (local) tryCatchList(expr, classes, parentenv, handlers)
> 6. \-base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
> 7. \-value[[3L]](cond)
> 8. \-arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
> 9. \-rlang::abort(msg, call = call)
>
> ```
> I am running this on a windows server, 512Gb of RAM.
> sessionInfo()
> R version 4.2.1 (2022-06-23 ucrt)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows Server 2012 R2 x64 (build 9600)
> Matrix products: default
> locale:
> [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252 LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
> [5] LC_TIME=Portuguese_Brazil.1252
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
> other attached packages:
> [1] arrow_10.0.0 data.table_1.14.4 forcats_0.5.2 dplyr_1.0.10 purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
> [9] ggplot2_3.3.6 tidyverse_1.3.2 gt_0.7.0 xtable_1.8-4 ggthemes_4.2.4 collapse_1.8.6 pryr_0.1.5 janitor_2.1.0
> [17] tictoc_1.1 lubridate_1.8.0 stringr_1.4.1 readxl_1.4.1
> loaded via a namespace (and not attached):
> [1] Rcpp_1.0.9 assertthat_0.2.1 digest_0.6.30 utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.4.1
> [8] reprex_2.0.2 httr_1.4.4 pillar_1.8.1 rlang_1.0.6 googlesheets4_1.0.1 rstudioapi_0.14 googledrive_2.0.0
> [15] bit_4.0.4 munsell_0.5.0 broom_1.0.1 compiler_4.2.1 modelr_0.1.9 pkgconfig_2.0.3 htmltools_0.5.3
> [22] tidyselect_1.2.0 codetools_0.2-18 fansi_1.0.3 crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0
> [29] grid_4.2.1 jsonlite_1.8.3 gtable_0.3.1 lifecycle_1.0.3 DBI_1.1.3 magrittr_2.0.3 scales_1.2.1
> [36] cli_3.4.1 stringi_1.7.8 fs_1.5.2 snakecase_0.11.0 xml2_1.3.3 ellipsis_0.3.2 generics_0.1.3
> [43] vctrs_0.5.0 tools_4.2.1 bit64_4.0.5 glue_1.6.2 hms_1.1.2 parallel_4.2.1 fastmap_1.1.0
> [50] colorspace_2.0-3 gargle_1.2.1 rvest_1.0.3 haven_2.5.1
>
> arrow_info()
> Arrow package version: 10.0.0
> Capabilities:
>
> dataset TRUE
> substrait FALSE
> parquet TRUE
> json TRUE
> s3 TRUE
> gcs TRUE
> utf8proc TRUE
> re2 TRUE
> snappy TRUE
> gzip TRUE
> brotli TRUE
> zstd TRUE
> lz4 TRUE
> lz4_frame TRUE
> lzo FALSE
> bz2 TRUE
> jemalloc FALSE
> mimalloc TRUE
> Arrow options():
>
> arrow.use_threads FALSE
> Memory:
>
> Allocator mimalloc
> Current 74.82 Gb
> Max 97.75 Gb
> Runtime:
>
> SIMD Level avx2
> Detected SIMD Level avx2
> Build:
>
> C++ Library Version 10.0.0
> C++ Compiler GNU
> C++ Compiler Version 10.3.0
> Git ID aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)