You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Lucas Mation (Jira)" <ji...@apache.org> on 2022/10/27 13:37:00 UTC

[jira] [Created] (ARROW-18176) [R] arrow::open_dataset %>% select(myvars) %>% collect causes memory leak

Lucas Mation created ARROW-18176:
------------------------------------

             Summary: [R] arrow::open_dataset %>% select(myvars) %>% collect causes memory leak
                 Key: ARROW-18176
                 URL: https://issues.apache.org/jira/browse/ARROW-18176
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Lucas Mation


I first posted on StackOverlow, [here.|https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak]

 

I am having trouble using arrow in R. First, I saved some {{data.tables}} that were about 50-60Gb ({{{}d{}}} in the code chunk) in memory to a parquet file using:
 
{{d %>% write_dataset(f, format='parquet')  # f is the directory name}}

Then I try to read open the file, select the relevant variables and
 
{{tic()d2 <-  open_dataset(f) %>% select(all_of(myvars)) %>% collect #myvars is a vector of variable namestoc()}}

I did this conversion for 3 sets of data.tables (unfortunately, data is confidential so I can't include in the example). In one set, I was able to {{open>select>collect}} the desired table in about 60s, obtaining a 10Gb file (after variable selection).

For the other two sets, the command caused a memory leak. tic()-toc() returned after 80s. But the object name (d2) never appeared in Rstudio's "Enviroment panel", and memory used keeps creeping up until it occupied most of the available RAM of the server, and then R crashed. Note the orginal dataset, without subsetting cols, was smaller than 60Gb and the server had 512GB.

Any ideas on what could be going on here?

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)