You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/10/27 12:38:00 UTC

[jira] [Commented] (ARROW-17541) [R] Substantial RAM use increase in 9.0.0 release on write_dataset()

    [ https://issues.apache.org/jira/browse/ARROW-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625079#comment-17625079 ] 

Dewey Dunnington commented on ARROW-17541:
------------------------------------------

This may or may not be related, but we have a report of "leaked memory" from a dataset collect here:

https://stackoverflow.com/questions/74221492/r-arrow-open-dataset-selectmyvars-collect-causing-memory-leak

> [R] Substantial RAM use increase in 9.0.0 release on write_dataset()
> --------------------------------------------------------------------
>
>                 Key: ARROW-17541
>                 URL: https://issues.apache.org/jira/browse/ARROW-17541
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 9.0.0
>            Reporter: Carl Boettiger
>            Priority: Critical
>         Attachments: Screenshot 2022-08-30 at 14-23-20 Online Graph Maker · Plotly Chart Studio.png
>
>
> Consider the following example of opening a remote dataset (a single 4 GB parquet file) and streaming it to disk. Consider this reprex:
>  
> {code:java}
> s3 <- arrow::s3_bucket("data", endpoint_override = "minio3.ecoforecast.org", anonymous=TRUE)
> df <- arrow::open_dataset(s3$path("waq_test"))
> arrow::write_dataset(df, tempfile())
>  {code}
> In 8.0.0, this operation peaks at about ~10 GB RAM use, which is already surprisingly high (when the whole file is 4 GB when on disk), but on arrow 9.0.0 RAM use for the same operation approximately doubles, which is large enough to trigger the OOM killer on the task in several of our active production workflows. 
>  
> Can this large RAM use increase introduced in 9.0 be avoided?  Is it possible for this operation to use even less RAM than it does in 8.0 release?  Is there something about this particular parquet file that should be responsible for the large RAM use? 
>  
> Arrow's impressively fast performance on large data on remote hosts is really game-changing for us.  Still, the OOM errors are a bit unexpected at this scale (i.e. single 4GB parquet file), as R users we really depend on arrow's out-of-band operations to work with larger-than-RAM data.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)