You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/04/25 17:14:00 UTC
[jira] [Commented] (ARROW-16320) Dataset re-partitioning consumes considerable amount of memory

    [ https://issues.apache.org/jira/browse/ARROW-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527653#comment-17527653 ] 

Weston Pace commented on ARROW-16320:
-------------------------------------

How are you measuring memory used?

There is a known issue when scanning parquet that it uses more RAM than expected.    8.0.0 should behave a little more reliably.  Exactly how RAM is expected depends on the structure of your input files.  I'm working on documenting that this week.  However, in general, I would estimate a few GB of process RAM to be needed for this operation.

I would not expect any process memory (e.g. RSS assigned to the process) to remain after the operation.  If you are on Linux, we use jemalloc by default, and it is configured so you might need to wait up to 1 second for all the memory to be returned to the OS.

If you are measuring RAM with a tool like Linux's `free` then I would also expect you would see a large (potentially all) chunk of RAM move from the {{free}} column and into the {{buf/cache}} column.  That would persist even after the repartitioning is done.  However, that RAM should be "available" RAM and this is just kind of how the Linux disk cache works.  I'd like to add a solution to do writes with direct I/O at some point which would avoid this.

> Dataset re-partitioning consumes considerable amount of memory
> --------------------------------------------------------------
>
>                 Key: ARROW-16320
>                 URL: https://issues.apache.org/jira/browse/ARROW-16320
>             Project: Apache Arrow
>          Issue Type: Improvement
>    Affects Versions: 7.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>
> A short background: I was trying to create a dataset from a big pile of csv files (couple of hundreds). In first step the csv were parsed and saved to parquet files because there were many inconsistencies between csv files. In a consequent step the dataset was re-partitioned using one column (code_key).
>  
> {code:java}
> new_dataset <- open_dataset(
>   temp_parquet_folder, 
>   format = "parquet",
>   unify_schemas = TRUE
>   )
> new_dataset |> 
>   group_by(code_key) |> 
>   write_dataset(
>     folder_repartitioned_dataset, 
>     format = "parquet"
>   )
> {code}
>  
> This re-partitioning consumed a considerable amount of memory (5 GB). 
>  * Is this a normal behavior?  Or a bug?
>  * Is there any rule of thumb to estimate the memory requirement for a dataset re-partitioning? (it’s important when scaling up this approach)
> The drawback is that this memory space is not freed up after the re-partitioning  (I am using RStudio). 
> The {{gc()}} useless in this situation. And there is no any associated object (to the repartitioning) in the {{R}} environment which can be removed from memory (using the {{rm()}} function).
>  * How one can regain this memory space used by re-partitioning?
> The rationale behind choosing the dataset re-partitioning: if my understanding is correct,  in the current arrow version the append is not working when writing parquet files/datasets. (the original csv files were partly partitioned according to a different variable)
> Can you recommend any better approach?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)