You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "András Svraka (Jira)" <ji...@apache.org> on 2020/09/24 10:12:00 UTC
[jira] [Created] (ARROW-10080) Arrow does not release unused memory

András Svraka created ARROW-10080:
-------------------------------------

             Summary: Arrow does not release unused memory
                 Key: ARROW-10080
                 URL: https://issues.apache.org/jira/browse/ARROW-10080
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 1.0.1
         Environment: Linux, Windows
            Reporter: András Svraka
         Attachments: sessioninfo.txt

I’m having problems when {{collect()}}-ing Arrow data sources into data frames that are close in size to the available memory on the machine. Consider the following workflow. I have a dataset which I want to query so that at some point in needs to be {{collect()}}-ed but at the same I’m also reducing the result. During the intermediate step the entire data frame fits into memory, and the following code runs without any problems.
{code:r}
test_ds <- "memory_test"

ds1 <- open_dataset(test_ds) %>%
  collect() %>%
  dim()
{code}
However, running the same code in the same R session again fails with R running out of memory.
{code:r}
ds1 <- open_dataset(test_ds) %>%
  collect() %>%
  dim()
{code}
The example might be a but contrived but you can easily imagine a workflow where different queries are ran on a dataset and the reduced results are stored.

As far as I understand, R is a garbage collected language, and in this case there aren’t any references left to large objects in memory. And indeed, the second query succeeds when manually forcing a garbage collection.

Is this the expected behaviour from Arrow?

I know, this is quite hard to reproduce, as the exact dataset size required to trigger this behaviour depends on the particular machine but I prepared a reproducible example in [this gist|https://gist.github.com/svraka/c63fca51c6cc50020551e2319ff652b7], that should give the same result on Ubuntu 20.04 with 1GB RAM and no swap. See attachment for {{sessionInfo()}} output. I ran it on a Digitalocean {{s-1vcpu-1gb}} droplet.

First, let’s create a a partitioned Arrow dataset:
{code:java}
$ Rscript ds_prep.R 1000000 5
{code}
The first command line argument gives the number of rows in each partition, and second gives the number of partitions. The parameters are set so that the entire dataset should fit into memory.

Then running the two queries fails:
{code:java}
$ Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
[1]    11151 killed     Rscript ds_read.R
{code}
However, when forcing a {{gc()}} (which I’m controlling here with a command line argument), it succeeds:
{code:java}
$ Rscript ds_read.R 1
Running query, 1st try...
ds size, 1st run: 56
running gc() ...
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  703052 37.6    1571691  84.0  1038494  55.5
Vcells 1179578  9.0   36405636 277.8 41188956 314.3
Running query, 2nd try...
ds size, 2nd run: 56
{code}
In general, [one shouldn’t have to use {{gc()}} manually|https://adv-r.hadley.nz/names-values.html#gc]. Interestingly, setting R’s garbage collection more aggressive (see {{?Memory}}) doesn’t help either:
{code:java}
$ R_GC_MEM_GROW=0 Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
[1]    11422 killed     Rscript ds_read.R
{code}
I didn’t try to reproduce this problem on macOS, as my Mac would probably start swapping furiously but I managed to reproduce it on a Windows 7 machine with practically no swap. Of course the parameters are different, and the error messages are presumably system specific.
{code:java}
$ Rscript ds_prep.R 1000000 40
$ Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
Error in dataset___Scanner__ToTable(self) :
  IOError: Out of memory: malloc of size 524288 failed
Calls: collect ... shared_ptr -> shared_ptr_is_null -> dataset___Scanner__ToTable
Execution halted
$ Rscript ds_read.R 1
Running query, 1st try...
ds size, 1st run: 56
running gc() ...
          used (Mb) gc trigger   (Mb)  max used (Mb)
Ncells  688789 36.8    1198030   64.0   1198030   64
Vcells 1109451  8.5  271538343 2071.7 321118845 2450
Running query, 2nd try...
ds size, 2nd run: 56
$ R_GC_MEM_GROW=0 Rscript ds.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
Error in dataset___Scanner__ToTable(self) :
  IOError: Out of memory: malloc of size 524288 failed
Calls: collect ... shared_ptr -> shared_ptr_is_null -> dataset___Scanner__ToTable
Execution halted
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)