You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Zsolt Kegyes-Brassai (Jira)" <ji...@apache.org> on 2022/04/27 08:53:00 UTC

[jira] [Comment Edited] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

    [ https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528661#comment-17528661 ] 

Zsolt Kegyes-Brassai edited comment on ARROW-16320 at 4/27/22 8:52 AM:
-----------------------------------------------------------------------

Hi [~westonpace] 

I tried to create a reproducible example.

In the first step I created a dummy dataset wit nearly 100 M rows, having different column types and missing data.
When writing this dataset to a parquet file I realized, that even the {{write_parquet()}} consumes a large amount of memory which is not returned back.

Here is the data generation part:

 
{code:java}
library(tidyverse)
n = 99e6 + as.integer(1e6 * runif(n = 1))
# n = 1000
a = 
  tibble(
    key1 = sample(datasets::state.abb, size = n, replace = TRUE),
    key2 = sample(datasets::state.name, size = n, replace = TRUE),
    subkey1 = sample(LETTERS, size = n, replace = TRUE),
    subkey2 = sample(letters, size = n, replace = TRUE),
    value1 = runif(n = n),
    value2 = as.integer(1000 * runif(n = n)),
    time = as.POSIXct(1e8 * runif(n = n), tz = "UTC", origin = "2020-01-01")
  ) |> 
  mutate(
    subkey1 = if_else(key1 %in% c("WA", "WV", "WI", "WY"), 
                      subkey1, NA_character_),
    subkey2 = if_else(key2 %in% c("Washington", "West Virginia", "Wisconsin", "Wyoming"), 
                      subkey2, NA_character_),
  )
lobstr::obj_size(a)
#> 5,177,583,640 B
{code}
and the memory utilization after the dataset creation

!100m_1_create.jpg!

and writing to *{{rds}}* file
{code:java}
readr::write_rds(a, here::here("db", "test100m.rds")){code}
no visible memory utilization increase

!100m_2_rds.jpg!

and writing to *parquet* file 
{code:java}
arrow::write_parquet(a, here::here("db", "test100m.parquet")){code}
there is a drastic increase in memory utilization 10.6 GB -> 15 GB - just for writing the file

!100m_3_parquet.jpg!
It looks that this memory amount consumed during writing the parquet file was not returned back even after 15 minutes.

My biggest concern is that the ability to handle datasets larger than the available memory seems increasingly remote.
I consider that this is a critical bug, but it might happen that is affecting only me… as I don’t have possibility to test elsewhere.


was (Author: kbzsl):
Hi [~westonpace] 

I tried to create a reproducible example.

In the first step I created a dummy dataset wit nearly 100 M rows, having different column types and missing data.
When writing this dataset to a parquet file I realized, that even the {{write_parquet()}} consumes a large amount of memory which is not returned back.

Here is the data generation part:

 
{code:java}
library(tidyverse)
n = 99e6 + as.integer(1e6 * runif(n = 1))
# n = 1000
a = 
  tibble(
    key1 = sample(datasets::state.abb, size = n, replace = TRUE),
    key2 = sample(datasets::state.name, size = n, replace = TRUE),
    subkey1 = sample(LETTERS, size = n, replace = TRUE),
    subkey2 = sample(letters, size = n, replace = TRUE),
    value1 = runif(n = n),
    value2 = as.integer(1000 * runif(n = n)),
    time = as.POSIXct(1e8 * runif(n = n), tz = "UTC", origin = "2020-01-01")
  ) |> 
  mutate(
    subkey1 = if_else(key1 %in% c("WA", "WV", "WI", "WY"), 
                      subkey1, NA_character_),
    subkey2 = if_else(key2 %in% c("Washington", "West Virginia", "Wisconsin", "Wyoming"), 
                      subkey2, NA_character_),
  )
lobstr::obj_size(a)
#> 5,177,583,640 B
{code}
and the memory utilization after the dataset creation

!100m_1_create.jpg!

and writing to *{{rds}}* file
{code:java}
readr::write_rds(a, here::here("db", "test100m.rds")){code}
no visible memory utilization increase

!100m_2_rds.jpg!

and writing to *parquet* file 
{code:java}
arrow::write_parquet(a, here::here("db", "test100m.parquet")){code}
there is a drastic increase in memory utilization 10.6 GB -> 15 GB - just for writing the file

!100m_3_parquet.jpg!
+This memory amount consumed during writing the parquet file was not returned back even after 15 minutes.+

My biggest concern is that the ability to handle datasets larger than the available memory seems increasingly remote.
I consider that this is a critical bug, but it might happen that is affecting only me… as I don’t have possibility to test elsewhere.

> Dataset re-partitioning consumes considerable amount of memory
> --------------------------------------------------------------
>
>                 Key: ARROW-16320
>                 URL: https://issues.apache.org/jira/browse/ARROW-16320
>             Project: Apache Arrow
>          Issue Type: Bug
>    Affects Versions: 7.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>         Attachments: 100m_1_create.jpg, 100m_2_rds.jpg, 100m_3_parquet.jpg, Rgui_mem.jpg, Rstudio_env.jpg, Rstudio_mem.jpg
>
>
> A short background: I was trying to create a dataset from a big pile of csv files (couple of hundreds). In first step the csv were parsed and saved to parquet files because there were many inconsistencies between csv files. In a consequent step the dataset was re-partitioned using one column (code_key).
>  
> {code:java}
> new_dataset <- open_dataset(
>   temp_parquet_folder, 
>   format = "parquet",
>   unify_schemas = TRUE
>   )
> new_dataset |> 
>   group_by(code_key) |> 
>   write_dataset(
>     folder_repartitioned_dataset, 
>     format = "parquet"
>   )
> {code}
>  
> This re-partitioning consumed a considerable amount of memory (5 GB). 
>  * Is this a normal behavior?  Or a bug?
>  * Is there any rule of thumb to estimate the memory requirement for a dataset re-partitioning? (it’s important when scaling up this approach)
> The drawback is that this memory space is not freed up after the re-partitioning  (I am using RStudio). 
> The {{gc()}} useless in this situation. And there is no any associated object (to the repartitioning) in the {{R}} environment which can be removed from memory (using the {{rm()}} function).
>  * How one can regain this memory space used by re-partitioning?
> The rationale behind choosing the dataset re-partitioning: if my understanding is correct,  in the current arrow version the append is not working when writing parquet files/datasets. (the original csv files were partly partitioned according to a different variable)
> Can you recommend any better approach?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)