You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Carl Boettiger (Jira)" <ji...@apache.org> on 2022/08/26 22:50:00 UTC

[jira] [Created] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()

Carl Boettiger created ARROW-17541:
--------------------------------------

             Summary: [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
                 Key: ARROW-17541
                 URL: https://issues.apache.org/jira/browse/ARROW-17541
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 9.0.0
            Reporter: Carl Boettiger


Consider the following example of opening a remote dataset (a single 4 GB parquet file) and streaming it to disk. Consider this reprex:

 
s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", anonymous=TRUE)
df <- arrow::open_dataset(s3$path("waq_test"))
arrow::write_dataset(df, tempfile())
 

In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already surprisingly high (when the whole file is 4 GB when on disk), but on arrow 9.0.0 RAM use for the same operation approximately doubles, which is large enough to trigger the OOM killer on the task in several of our active production workflows. 

 

Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible for this operation to use even less RAM than it does in 8.0 release?  Is there something about this particular parquet file that should be responsible for the large RAM use? 

 

Arrow's impressively fast performance on large data on remote hosts is really game-changing for us.  Still, the OOM errors are a bit unexpected at this scale (i.e. single 4GB parquet file), as R users we really depend on arrow's out-of-band operations to work with larger-than-RAM data.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)