You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2021/04/24 22:02:00 UTC

[jira] [Updated] (ARROW-12529) [R] Writing to Parquet from tibble Consumes Large Amount of Memory

     [ https://issues.apache.org/jira/browse/ARROW-12529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-12529:
---------------------------------
    Summary: [R] Writing to Parquet from tibble Consumes Large Amount of Memory  (was: Writing to Parquet from tibble Consumes Large Amount of Memory)

> [R] Writing to Parquet from tibble Consumes Large Amount of Memory
> ------------------------------------------------------------------
>
>                 Key: ARROW-12529
>                 URL: https://issues.apache.org/jira/browse/ARROW-12529
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Jared Lander
>            Priority: Major
>
> When writing a large `tibble` to a parquet file, a large amount of memory is consumed. I first discovered this when using `targets::tar_read(obj)` to load in an object that had been saved in the parquet format. That particular object was an `sf` object with about 20 million rows and 26 columns. For a 5-6 GB object, memory ballooned by 22 GB.
> I wrote the following code to test this using a regular `tibble`, not `sf`. In this test memory increases dramatically when writing, but not when reading, which I'm still trying to figure out.
> {code:java}
> library(arrow)
> library(dplyr)
> library(lobstr)
> library(tictoc)n <- 10000000system('free -m')
> tic()
> fake <- tibble(
>     ID=seq(n),
>     x=runif(n=n, min=-170, max=170),
>     y=runif(n=n, min=-60, max=70),
>     text1=sample(x=state.name, size=n, replace=TRUE),
>     text2=sample(x=state.name, size=n, replace=TRUE),
>     text3=sample(x=state.division, size=n, replace=TRUE),
>     text4=sample(x=state.region, size=n, replace=TRUE),
>     text5=sample(x=state.abb, size=n, replace=TRUE),
>     num1=sample(x=state.center$x, size=n, replace=TRUE),
>     num2=sample(x=state.center$y, size=n, replace=TRUE),
>     num3=sample(x=state.area, size=n, replace=TRUE),
>     Rand1=rnorm(n=n),
>     Rand2=rnorm(n=n, mean=100, sd=3),
>     Rand3=rbinom(n=n, size=10, prob=0.4)
> )
> toc()
> system('free -m')obj_size(fake)/1024/1024/1024system('free -m')
> tic()
> write_parquet(fake, 'data/write_fake.parquet')
> toc()
> system('free -m')system('free -m')
> gc()
> system('free -m')system('free -m')
> tic()
> fake_parquet <- read_parquet('data/write_test.parquet')
> toc()
> system('free -m')
> obj_size(spat_parquet)/1024/1024/1024
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)