You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/12/06 09:39:00 UTC
[jira] [Commented] (ARROW-18372) [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell

    [ https://issues.apache.org/jira/browse/ARROW-18372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643753#comment-17643753 ] 

Nicola Crane commented on ARROW-18372:
--------------------------------------

Thanks for reporting this [~lucasmation].  This might be a bit of a tricky one to pin down; do you know if this was an issue in previous versions of Arrow or just 10.0.0; are you able to install 9.0.0 and test it there, so we can check if it's a regression or an existing bug?

> [R] "Error in `collect()`: ! Invalid: negative malloc size" after large computation returning one cell
> ------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-18372
>                 URL: https://issues.apache.org/jira/browse/ARROW-18372
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 10.0.0
>            Reporter: Lucas Mation
>            Priority: Major
>
> I have a large parquet file 900 million rows , 40cols parquet file, subdivided into folders for each year. I was trying to calculate how many unique combinations of id1+id2+id3+id4 there are in the dataset.
>  
> Notice that the "collected" dataset is supposed to be only one row and one cel, containing the count (I've confirmed this by subseting the dataset ("%>% head(10^6)" ) before computing the count, and it works). That is why the error below is so weird
> ```
> fa <- 'myparteq folder' #huge 
> va <- open_dataset(fa)
> tic()
> d <- va  %>% head(10^6) %>% count(id1,id2,id3,id4) %>% count %>% collect
> toc()
>  
> Error in `collect()`:
> ! Invalid: negative malloc size
> Run `rlang::last_error()` to see where the error occurred.
>  
> > rlang::last_error()
> <error/rlang_error>
> Error in `collect()`:
> ! Invalid: negative malloc size
> ---
> Backtrace:
>  1. ... %>% collect
>  3. arrow:::collect.arrow_dplyr_query(.)
> Run `rlang::last_trace()` to see the full context.
>  
> > rlang::last_trace()
> <error/rlang_error>
> Error in `collect()`:
> ! Invalid: negative malloc size
> ---
> Backtrace:
>     x
>  1. +-... %>% collect
>  2. +-dplyr::collect(.)
>  3. \-arrow:::collect.arrow_dplyr_query(.)
>  4.   \-base::tryCatch(...)
>  5.     \-base (local) tryCatchList(expr, classes, parentenv, handlers)
>  6.       \-base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  7.         \-value[[3L]](cond)
>  8.           \-arrow:::augment_io_error_msg(e, call, schema = x$.data$schema)
>  9.             \-rlang::abort(msg, call = call)
>  
> ```
> I am running this on a windows server, 512Gb of RAM.
>  sessionInfo()
> R version 4.2.1 (2022-06-23 ucrt)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows Server 2012 R2 x64 (build 9600)
> Matrix products: default
> locale:
> [1] LC_COLLATE=Portuguese_Brazil.1252  LC_CTYPE=Portuguese_Brazil.1252    LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C                      
> [5] LC_TIME=Portuguese_Brazil.1252    
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base     
> other attached packages:
>  [1] arrow_10.0.0      data.table_1.14.4 forcats_0.5.2     dplyr_1.0.10      purrr_0.3.5  readr_2.1.3       tidyr_1.2.1       tibble_3.1.8     
>  [9] ggplot2_3.3.6     tidyverse_1.3.2   gt_0.7.0          xtable_1.8-4      ggthemes_4.2.4    collapse_1.8.6    pryr_0.1.5        janitor_2.1.0    
> [17] tictoc_1.1        lubridate_1.8.0   stringr_1.4.1     readxl_1.4.1     
> loaded via a namespace (and not attached):
>  [1] Rcpp_1.0.9          assertthat_0.2.1    digest_0.6.30       utf8_1.2.2          R6_2.5.1            cellranger_1.1.0    backports_1.4.1    
>  [8] reprex_2.0.2        httr_1.4.4          pillar_1.8.1        rlang_1.0.6         googlesheets4_1.0.1 rstudioapi_0.14     googledrive_2.0.0  
> [15] bit_4.0.4           munsell_0.5.0       broom_1.0.1         compiler_4.2.1      modelr_0.1.9        pkgconfig_2.0.3     htmltools_0.5.3    
> [22] tidyselect_1.2.0    codetools_0.2-18    fansi_1.0.3         crayon_1.5.2        tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0        
> [29] grid_4.2.1          jsonlite_1.8.3      gtable_0.3.1        lifecycle_1.0.3     DBI_1.1.3           magrittr_2.0.3      scales_1.2.1       
> [36] cli_3.4.1           stringi_1.7.8       fs_1.5.2            snakecase_0.11.0    xml2_1.3.3          ellipsis_0.3.2      generics_0.1.3     
> [43] vctrs_0.5.0         tools_4.2.1         bit64_4.0.5         glue_1.6.2          hms_1.1.2           parallel_4.2.1      fastmap_1.1.0      
> [50] colorspace_2.0-3    gargle_1.2.1        rvest_1.0.3         haven_2.5.1    
>  
>  arrow_info()
> Arrow package version: 10.0.0
> Capabilities:
>                
> dataset    TRUE
> substrait FALSE
> parquet    TRUE
> json       TRUE
> s3         TRUE
> gcs        TRUE
> utf8proc   TRUE
> re2        TRUE
> snappy     TRUE
> gzip       TRUE
> brotli     TRUE
> zstd       TRUE
> lz4        TRUE
> lz4_frame  TRUE
> lzo       FALSE
> bz2        TRUE
> jemalloc  FALSE
> mimalloc   TRUE
> Arrow options():
>                        
> arrow.use_threads FALSE
> Memory:
>                   
> Allocator mimalloc
> Current   74.82 Gb
> Max       97.75 Gb
> Runtime:
>                         
> SIMD Level          avx2
> Detected SIMD Level avx2
> Build:
>                                                              
> C++ Library Version                                    10.0.0
> C++ Compiler                                              GNU
> C++ Compiler Version                                   10.3.0
> Git ID               aa7118b6e5f49b354fa8a93d9cf363c9ebe9a3f0
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)