You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2020/09/24 10:43:00 UTC

[jira] [Comment Edited] (ARROW-10052) [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

    [ https://issues.apache.org/jira/browse/ARROW-10052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201440#comment-17201440 ] 

Antoine Pitrou edited comment on ARROW-10052 at 9/24/20, 10:42 AM:
-------------------------------------------------------------------

I'm not sure there's anything surprising. Running this thing a bit (in debug mode!), I see that RSS usage grows by 500-1000 bytes for each column chunk (that is, each column in a row group).

This seems to be simply the Parquet file metadata accumulating before it can be written at the end (when the ParquetWriter is closed).  {{format::FileMetadata}} has a vector of {{format::RowGroup}} (one per row group). {{format::RowGroup}} has a vector of {{format::Column}} (one per column). Each {{format::Column}} holds non-trivial information: file name, column metadata (itself potentially large).

So, basically you should write only large row groups to Parquet files. Writing 100 rows at a time makes the Parquet format completely inadequate. Replace that with at least 10000 or 100000 rows, IMHO.


was (Author: pitrou):
I'm not sure there's anything surprising. Running this thing a bit (in debug mode!), I see that each RSS usage grows by 500-1000 bytes each column chunk (that is, each column in a row group).

This seems to be simply the Parquet file metadata accumulating before it can be written at the end (when the ParquetWriter is closed).  {{format::FileMetadata}} has a vector of {{format::RowGroup}} (one per row group). {{format::RowGroup}} has a vector of {{format::Column}} (one per column). Each {{format::Column}} holds non-trivial information: file name, column metadata (itself potentially large).

So, basically you should write only large row groups to Parquet files. Writing 100 rows at a time makes the Parquet format completely inadequate. Replace that with at least 10000 or 100000 rows, IMHO.

> [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-10052
>                 URL: https://issues.apache.org/jira/browse/ARROW-10052
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 1.0.1
>            Reporter: Niklas B
>            Priority: Minor
>
> This ticket refers to the discussion between me and [~emkornfield] on the MailingList: "Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)" (not yet available on the mail archives)
> Original post:
> {quote}Hi,
>  I'm trying to write a large parquet file onto disk (larger then memory) using PyArrows ParquetWriter and write_table, but even though the file is written incrementally to disk it still appears to keeps the entire dataset in memory (eventually getting OOM killed). Basically what I am trying to do is:
>  with pq.ParquetWriter(
>                  output_file,
>                  arrow_schema,
>                  compression='snappy',
>                  allow_truncated_timestamps=True,
>                  version='2.0',  # Highest available schema
>                  data_page_version='2.0',  # Highest available schema
>          ) as writer:
>              for rows_dataframe in function_that_yields_data():
>                  writer.write_table(
>                      pa.Table.from_pydict(
>                              rows_dataframe,
>                              arrow_schema
>                      )
>                  )
>  Where I have a function that yields data and then write it in chunks using write_table. 
>  Is it possible to force the ParquetWriter to not keep the entire dataset in memory, or is it simply not possible for good reasons?
>  I’m streaming data from a database and writes it to Parquet. The end-consumer has plenty of ram, but the machine that does the conversion doesn’t. 
>  Regards,
>  Niklas
> {quote}
> Minimum example (I can't attach as a file for some reason) [https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95]
> Looking at it now when I've made a minimal example I see something I didn't see/realize before which is that while the memory usage is increasing it doesn't appear to be linear to the file written. This indicates (I guess) that it isn't actually storing the written dataset, but something else. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)