You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "r2evans (via GitHub)" <gi...@apache.org> on 2023/03/07 13:51:12 UTC
[GitHub] [arrow] r2evans opened a new issue, #34487: memory allocation crash
r2evans opened a new issue, #34487:
URL: https://github.com/apache/arrow/issues/34487
### Describe the bug, including details regarding any error messages, version, and platform.
Motived by https://stackoverflow.com/questions/75657380/readr-vs-data-table-different-results-on-fedora, I downloaded its sample data (https://www.usitc.gov/data/gravity/itpd_e/itpd_e_r02.zip) and read the CSV with various functions. I was able to read the file successfully (albeit slowly for most) using `utils::read.csv`, `readr::read_csv`, `data.table::fread`, and `arrow::open_dataset(., format="csv")`, but when I tried this, my R crashed:
```r
packageVersion("arrow")
# [1] '10.0.1'
obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
# D:/a/rtools-packages/rtools-packages/mingw-w64-arrow/src/apache-arrow-10.0.1/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Out of memory: malloc of size 262144 failed
# Process R:3 exited abnormally with code 9 at Tue Mar 7 08:30:27 2023
```
(FYI, I do not have a `D:` drive, that must be compiled into the symbols.)
I tried it again, same computer, new/fresh R process, same file, different error:
```r
obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
# terminate called after throwing an instance of 'cpp11::unwind_exception'
# what(): std::exception
# Process R:3 exited abnormally with code 9 at Tue Mar 7 08:39:53 2023
```
I tried upgrading arrow and it still fails:
```r
packageVersion("arrow")
# [1] '11.0.0.2'
obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
# terminate called after throwing an instance of 'cpp11::unwind_exception'
# what(): std::exception
# Process R:3 exited abnormally with code 9 at Tue Mar 7 08:43:51 2023
```
The CSV file itself is 6.8GB and, once read into R, typically consumes 7GB+ of RAM.
My system is Win11 22H2 (OS Build 22621.1265) with 64GB of RAM, running R inside emacs/ess.
For perspective, the data does not appear to contain anything cosmic:
```r
obj3 <- arrow::open_dataset("~/Downloads/ITPD_E_R02.csv", format="csv")
dat <- head(obj3) %>%
collect()
dat
# # A tibble: 6 × 13
# export…¹ expor…² expor…³ impor…⁴ impor…⁵ impor…⁶ broad…⁷ indus…⁸ indus…⁹ year
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int> <chr> <int>
# 1 SVU SVU Soviet… AFG AFG Afghan… Agricu… 1 Wheat 1986
# 2 SVU SVU Soviet… AFG AFG Afghan… Agricu… 1 Wheat 1987
# 3 AUS AUS Austra… AFG AFG Afghan… Agricu… 1 Wheat 1989
# 4 FIN FIN Finland AFG AFG Afghan… Agricu… 1 Wheat 1989
# 5 IND IND India AFG AFG Afghan… Agricu… 1 Wheat 1990
# 6 BLX BLX Belgiu… AFG AFG Afghan… Agricu… 1 Wheat 1990
# # … with 3 more variables: trade <dbl>, flag_mirror <int>, flag_zero <chr>, and
# # abbreviated variable names ¹exporter_iso3, ²exporter_dynamic_code,
# # ³exporter_name, ⁴importer_iso3, ⁵importer_dynamic_code, ⁶importer_name,
# # ⁷broad_sector, ⁸industry_id, ⁹industry_descr
# # ℹ Use `colnames()` to see all variable names
dput(dat)
# structure(list(exporter_iso3 = c("SVU", "SVU", "AUS", "FIN",
# "IND", "BLX"), exporter_dynamic_code = c("SVU", "SVU", "AUS",
# "FIN", "IND", "BLX"), exporter_name = c("Soviet Union", "Soviet Union",
# "Australia", "Finland", "India", "Belgium-Luxembourg"), importer_iso3 = c("AFG",
# "AFG", "AFG", "AFG", "AFG", "AFG"), importer_dynamic_code = c("AFG",
# "AFG", "AFG", "AFG", "AFG", "AFG"), importer_name = c("Afghanistan",
# "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"
# ), broad_sector = c("Agriculture", "Agriculture", "Agriculture",
# "Agriculture", "Agriculture", "Agriculture"), industry_id = c(1L,
# 1L, 1L, 1L, 1L, 1L), industry_descr = c("Wheat", "Wheat", "Wheat",
# "Wheat", "Wheat", "Wheat"), year = c(1986L, 1987L, 1989L, 1989L,
# 1990L, 1990L), trade = c(14.761, 1.98, 0.191, 0.175, 0.553, 0.36
# ), flag_mirror = c(1L, 1L, 1L, 1L, 1L, 1L), flag_zero = c("p",
# "p", "p", "p", "p", "p")), class = c("tbl_df", "tbl", "data.frame"
# ), row.names = c(NA, -6L))
```
(I recognize that data of this size should be (at least) opened lazily using `open_dataset` or converted to a better storage format, that's not the point of this issue.)
---
Session info:
```r
sessioninfo::session_info()
# ─ Session info ───────────────────────────────────────────────────────────────
# setting value
# version R version 4.2.2 (2022-10-31 ucrt)
# os Windows 10 x64 (build 22621)
# system x86_64, mingw32
# ui RTerm
# language (EN)
# collate English_United States.utf8
# ctype English_United States.utf8
# tz America/New_York
# date 2023-03-07
# pandoc 2.17.1.1 @ C:/Users/r2/AppData/Local/Pandoc/ (via rmarkdown)
# ─ Packages ───────────────────────────────────────────────────────────────────
# package * version date (UTC) lib source
# cli 3.4.1 2022-09-23 [1] RSPM (R 4.2.0)
# digest 0.6.31 2022-12-11 [1] RSPM (R 4.2.0)
# evaluate 0.19 2022-12-13 [2] CRAN (R 4.2.2)
# fastmap 1.1.0 2021-01-25 [2] CRAN (R 4.2.2)
# htmltools 0.5.4 2022-12-07 [1] RSPM (R 4.2.0)
# knitr 1.41 2022-11-18 [1] RSPM (R 4.2.0)
# r2 * 0.9.15 2022-12-14 [1] local
# rlang 1.0.6 2022-09-24 [1] RSPM (R 4.2.0)
# rmarkdown 2.18 2022-11-09 [1] RSPM (R 4.2.0)
# sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.2.0)
# xfun 0.35 2022-11-16 [1] RSPM (R 4.2.0)
# [1] C:/Users/r2/AppData/Local/R/win-library/4.2
# [2] C:/R/R-4.2.2/library
# ──────────────────────────────────────────────────────────────────────────────
```
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] eitsupi commented on issue #34487: [R] memory allocation crash
Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on issue #34487:
URL: https://github.com/apache/arrow/issues/34487#issuecomment-1459833409
I tried it with R on Ubuntu 22.04 and arrow installed from RSPM binary, and was able to read CSV successfully. (10GB RAM used)
is it possible that this is a bug related to how arrow is installed or the OS?
```r
R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
Platform: x86_64-pc-linux-gnu (64-bit)
> obj3 <- arrow::read_csv_arrow("ITPD_E_R02.csv", as_data_frame = FALSE)
> obj3
Table
72534869 rows x 13 columns
$exporter_iso3 <string>
$exporter_dynamic_code <string>
$exporter_name <string>
$importer_iso3 <string>
$importer_dynamic_code <string>
$importer_name <string>
$broad_sector <string>
$industry_id <int64>
$industry_descr <string>
$year <int64>
$trade <double>
$flag_mirror <int64>
$flag_zero <string>
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org