You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jared Lander (Jira)" <ji...@apache.org> on 2021/04/24 22:01:00 UTC

[jira] [Created] (ARROW-12529) Writing to Parquet from tibble Consumes Large Amount of Memory

Jared Lander created ARROW-12529:
------------------------------------

             Summary: Writing to Parquet from tibble Consumes Large Amount of Memory
                 Key: ARROW-12529
                 URL: https://issues.apache.org/jira/browse/ARROW-12529
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 3.0.0
            Reporter: Jared Lander


When writing a large `tibble` to a parquet file, a large amount of memory is consumed. I first discovered this when using `targets::tar_read(obj)` to load in an object that had been saved in the parquet format. That particular object was an `sf` object with about 20 million rows and 26 columns. For a 5-6 GB object, memory ballooned by 22 GB.

I wrote the following code to test this using a regular `tibble`, not `sf`. In this test memory increases dramatically when writing, but not when reading, which I'm still trying to figure out.
{code:java}
library(arrow)
library(dplyr)
library(lobstr)
library(tictoc)n <- 10000000system('free -m')
tic()
fake <- tibble(
    ID=seq(n),
    x=runif(n=n, min=-170, max=170),
    y=runif(n=n, min=-60, max=70),
    text1=sample(x=state.name, size=n, replace=TRUE),
    text2=sample(x=state.name, size=n, replace=TRUE),
    text3=sample(x=state.division, size=n, replace=TRUE),
    text4=sample(x=state.region, size=n, replace=TRUE),
    text5=sample(x=state.abb, size=n, replace=TRUE),
    num1=sample(x=state.center$x, size=n, replace=TRUE),
    num2=sample(x=state.center$y, size=n, replace=TRUE),
    num3=sample(x=state.area, size=n, replace=TRUE),
    Rand1=rnorm(n=n),
    Rand2=rnorm(n=n, mean=100, sd=3),
    Rand3=rbinom(n=n, size=10, prob=0.4)
)
toc()
system('free -m')obj_size(fake)/1024/1024/1024system('free -m')
tic()
write_parquet(fake, 'data/write_fake.parquet')
toc()
system('free -m')system('free -m')
gc()
system('free -m')system('free -m')
tic()
fake_parquet <- read_parquet('data/write_test.parquet')
toc()
system('free -m')
obj_size(spat_parquet)/1024/1024/1024

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)