You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "r2evans (via GitHub)" <gi...@apache.org> on 2023/03/07 13:51:12 UTC

[GitHub] [arrow] r2evans opened a new issue, #34487: memory allocation crash

r2evans opened a new issue, #34487:
URL: https://github.com/apache/arrow/issues/34487

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Motived by https://stackoverflow.com/questions/75657380/readr-vs-data-table-different-results-on-fedora, I downloaded its sample data (https://www.usitc.gov/data/gravity/itpd_e/itpd_e_r02.zip) and read the CSV with various functions. I was able to read the file successfully (albeit slowly for most) using `utils::read.csv`, `readr::read_csv`, `data.table::fread`, and `arrow::open_dataset(., format="csv")`, but when I tried this, my R crashed:
   
   ```r
   packageVersion("arrow")
   # [1] '10.0.1'
   obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
   # D:/a/rtools-packages/rtools-packages/mingw-w64-arrow/src/apache-arrow-10.0.1/cpp/src/arrow/result.cc:28: ValueOrDie called on an error: Out of memory: malloc of size 262144 failed
   # Process R:3 exited abnormally with code 9 at Tue Mar  7 08:30:27 2023
   ```
   
   (FYI, I do not have a `D:` drive, that must be compiled into the symbols.)
   
   I tried it again, same computer, new/fresh R process, same file, different error:
   
   ```r
   obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
   # terminate called after throwing an instance of 'cpp11::unwind_exception'
   #   what():  std::exception
   # Process R:3 exited abnormally with code 9 at Tue Mar  7 08:39:53 2023
   ```
   
   I tried upgrading arrow and it still fails:
   
   ```r
   packageVersion("arrow")
   # [1] '11.0.0.2'
   obj3 <- arrow::read_csv_arrow("~/Downloads/ITPD_E_R02.csv")
   # terminate called after throwing an instance of 'cpp11::unwind_exception'
   #   what():  std::exception
   # Process R:3 exited abnormally with code 9 at Tue Mar  7 08:43:51 2023
   ```
   
   The CSV file itself is 6.8GB and, once read into R, typically consumes 7GB+ of RAM. 
   My system is Win11 22H2 (OS Build 22621.1265) with 64GB of RAM, running R inside emacs/ess.
   
   For perspective, the data does not appear to contain anything cosmic:
   
   ```r
   obj3 <- arrow::open_dataset("~/Downloads/ITPD_E_R02.csv", format="csv")
   dat <- head(obj3) %>%
     collect()
   dat
   # # A tibble: 6 × 13
   #   export…¹ expor…² expor…³ impor…⁴ impor…⁵ impor…⁶ broad…⁷ indus…⁸ indus…⁹  year
   #   <chr>    <chr>   <chr>   <chr>   <chr>   <chr>   <chr>     <int> <chr>   <int>
   # 1 SVU      SVU     Soviet… AFG     AFG     Afghan… Agricu…       1 Wheat    1986
   # 2 SVU      SVU     Soviet… AFG     AFG     Afghan… Agricu…       1 Wheat    1987
   # 3 AUS      AUS     Austra… AFG     AFG     Afghan… Agricu…       1 Wheat    1989
   # 4 FIN      FIN     Finland AFG     AFG     Afghan… Agricu…       1 Wheat    1989
   # 5 IND      IND     India   AFG     AFG     Afghan… Agricu…       1 Wheat    1990
   # 6 BLX      BLX     Belgiu… AFG     AFG     Afghan… Agricu…       1 Wheat    1990
   # # … with 3 more variables: trade <dbl>, flag_mirror <int>, flag_zero <chr>, and
   # #   abbreviated variable names ¹​exporter_iso3, ²​exporter_dynamic_code,
   # #   ³​exporter_name, ⁴​importer_iso3, ⁵​importer_dynamic_code, ⁶​importer_name,
   # #   ⁷​broad_sector, ⁸​industry_id, ⁹​industry_descr
   # # ℹ Use `colnames()` to see all variable names
   dput(dat)
   # structure(list(exporter_iso3 = c("SVU", "SVU", "AUS", "FIN", 
   # "IND", "BLX"), exporter_dynamic_code = c("SVU", "SVU", "AUS", 
   # "FIN", "IND", "BLX"), exporter_name = c("Soviet Union", "Soviet Union", 
   # "Australia", "Finland", "India", "Belgium-Luxembourg"), importer_iso3 = c("AFG", 
   # "AFG", "AFG", "AFG", "AFG", "AFG"), importer_dynamic_code = c("AFG", 
   # "AFG", "AFG", "AFG", "AFG", "AFG"), importer_name = c("Afghanistan", 
   # "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"
   # ), broad_sector = c("Agriculture", "Agriculture", "Agriculture", 
   # "Agriculture", "Agriculture", "Agriculture"), industry_id = c(1L, 
   # 1L, 1L, 1L, 1L, 1L), industry_descr = c("Wheat", "Wheat", "Wheat", 
   # "Wheat", "Wheat", "Wheat"), year = c(1986L, 1987L, 1989L, 1989L, 
   # 1990L, 1990L), trade = c(14.761, 1.98, 0.191, 0.175, 0.553, 0.36
   # ), flag_mirror = c(1L, 1L, 1L, 1L, 1L, 1L), flag_zero = c("p", 
   # "p", "p", "p", "p", "p")), class = c("tbl_df", "tbl", "data.frame"
   # ), row.names = c(NA, -6L))
   ```
   
   (I recognize that data of this size should be (at least) opened lazily using `open_dataset` or converted to a better storage format, that's not the point of this issue.)
   
   ---
   
   Session info:
   
   ```r
   sessioninfo::session_info()
   # ─ Session info ───────────────────────────────────────────────────────────────
   #  setting  value
   #  version  R version 4.2.2 (2022-10-31 ucrt)
   #  os       Windows 10 x64 (build 22621)
   #  system   x86_64, mingw32
   #  ui       RTerm
   #  language (EN)
   #  collate  English_United States.utf8
   #  ctype    English_United States.utf8
   #  tz       America/New_York
   #  date     2023-03-07
   #  pandoc   2.17.1.1 @ C:/Users/r2/AppData/Local/Pandoc/ (via rmarkdown)
   # ─ Packages ───────────────────────────────────────────────────────────────────
   #  package     * version date (UTC) lib source
   #  cli           3.4.1   2022-09-23 [1] RSPM (R 4.2.0)
   #  digest        0.6.31  2022-12-11 [1] RSPM (R 4.2.0)
   #  evaluate      0.19    2022-12-13 [2] CRAN (R 4.2.2)
   #  fastmap       1.1.0   2021-01-25 [2] CRAN (R 4.2.2)
   #  htmltools     0.5.4   2022-12-07 [1] RSPM (R 4.2.0)
   #  knitr         1.41    2022-11-18 [1] RSPM (R 4.2.0)
   #  r2          * 0.9.15  2022-12-14 [1] local
   #  rlang         1.0.6   2022-09-24 [1] RSPM (R 4.2.0)
   #  rmarkdown     2.18    2022-11-09 [1] RSPM (R 4.2.0)
   #  sessioninfo   1.2.2   2021-12-06 [1] RSPM (R 4.2.0)
   #  xfun          0.35    2022-11-16 [1] RSPM (R 4.2.0)
   #  [1] C:/Users/r2/AppData/Local/R/win-library/4.2
   #  [2] C:/R/R-4.2.2/library
   # ──────────────────────────────────────────────────────────────────────────────
   ```
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] eitsupi commented on issue #34487: [R] memory allocation crash

Posted by "eitsupi (via GitHub)" <gi...@apache.org>.
eitsupi commented on issue #34487:
URL: https://github.com/apache/arrow/issues/34487#issuecomment-1459833409

   I tried it with R on Ubuntu 22.04 and arrow installed from RSPM binary, and was able to read CSV successfully. (10GB RAM used)
   is it possible that this is a bug related to how arrow is installed or the OS?
   
   ```r
   R version 4.2.2 (2022-10-31) -- "Innocent and Trusting"
   Platform: x86_64-pc-linux-gnu (64-bit)
   
   > obj3 <- arrow::read_csv_arrow("ITPD_E_R02.csv", as_data_frame = FALSE)
   
   > obj3
   Table
   72534869 rows x 13 columns
   $exporter_iso3 <string>
   $exporter_dynamic_code <string>
   $exporter_name <string>
   $importer_iso3 <string>
   $importer_dynamic_code <string>
   $importer_name <string>
   $broad_sector <string>
   $industry_id <int64>
   $industry_descr <string>
   $year <int64>
   $trade <double>
   $flag_mirror <int64>
   $flag_zero <string>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org