You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jared Lander (Jira)" <ji...@apache.org> on 2021/04/24 22:01:00 UTC
[jira] [Created] (ARROW-12529) Writing to Parquet from tibble
Consumes Large Amount of Memory
Jared Lander created ARROW-12529:
------------------------------------
Summary: Writing to Parquet from tibble Consumes Large Amount of Memory
Key: ARROW-12529
URL: https://issues.apache.org/jira/browse/ARROW-12529
Project: Apache Arrow
Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Jared Lander
When writing a large `tibble` to a parquet file, a large amount of memory is consumed. I first discovered this when using `targets::tar_read(obj)` to load in an object that had been saved in the parquet format. That particular object was an `sf` object with about 20 million rows and 26 columns. For a 5-6 GB object, memory ballooned by 22 GB.
I wrote the following code to test this using a regular `tibble`, not `sf`. In this test memory increases dramatically when writing, but not when reading, which I'm still trying to figure out.
{code:java}
library(arrow)
library(dplyr)
library(lobstr)
library(tictoc)n <- 10000000system('free -m')
tic()
fake <- tibble(
ID=seq(n),
x=runif(n=n, min=-170, max=170),
y=runif(n=n, min=-60, max=70),
text1=sample(x=state.name, size=n, replace=TRUE),
text2=sample(x=state.name, size=n, replace=TRUE),
text3=sample(x=state.division, size=n, replace=TRUE),
text4=sample(x=state.region, size=n, replace=TRUE),
text5=sample(x=state.abb, size=n, replace=TRUE),
num1=sample(x=state.center$x, size=n, replace=TRUE),
num2=sample(x=state.center$y, size=n, replace=TRUE),
num3=sample(x=state.area, size=n, replace=TRUE),
Rand1=rnorm(n=n),
Rand2=rnorm(n=n, mean=100, sd=3),
Rand3=rbinom(n=n, size=10, prob=0.4)
)
toc()
system('free -m')obj_size(fake)/1024/1024/1024system('free -m')
tic()
write_parquet(fake, 'data/write_fake.parquet')
toc()
system('free -m')system('free -m')
gc()
system('free -m')system('free -m')
tic()
fake_parquet <- read_parquet('data/write_test.parquet')
toc()
system('free -m')
obj_size(spat_parquet)/1024/1024/1024
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)