You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Zsolt Kegyes-Brassai (Jira)" <ji...@apache.org> on 2022/04/25 15:46:00 UTC
[jira] [Created] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory
Zsolt Kegyes-Brassai created ARROW-16320:
--------------------------------------------
Summary: Dataset re-partitioning consumes considerable amount of memory
Key: ARROW-16320
URL: https://issues.apache.org/jira/browse/ARROW-16320
Project: Apache Arrow
Issue Type: Improvement
Affects Versions: 7.0.0
Reporter: Zsolt Kegyes-Brassai
A short background: I was trying to create a dataset from a big pile of csv files (couple of hundreds). In first step the csv were parsed and saved to parquet files because there were many inconsistencies between csv files. In a consequent step the dataset was re-partitioned using one column (code_key).
{code:java}
new_dataset <- open_dataset(
temp_parquet_folder,
format = "parquet",
unify_schemas = TRUE
)
new_dataset |>
group_by(code_key) |>
write_dataset(
folder_repartitioned_dataset,
format = "parquet"
)
{code}
This re-partitioning consumed a considerable amount of memory (5 GB).
* Is this a normal behavior? Or a bug?
* Is there any rule of thumb to estimate the memory requirement for a dataset re-partitioning? (it’s important when scaling up this approach)
The drawback is that this memory space is not freed up after the re-partitioning (I am using RStudio).
The {{gc()}} useless in this situation. And there is no any associated object (to the repartitioning) in the {{R}} environment which can be removed from memory (using the {{rm()}} function).
* How one can regain this memory space used by re-partitioning?
The rationale behind choosing the dataset re-partitioning: if my understanding is correct, in the current arrow version the append is not working when writing parquet files/datasets. (the original csv files were partly partitioned according to a different variable)
Can you recommend any better approach?
--
This message was sent by Atlassian Jira
(v8.20.7#820007)