You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Niklas B (Jira)" <ji...@apache.org> on 2020/09/21 09:13:00 UTC

[jira] [Created] (ARROW-10052) [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)

Niklas B created ARROW-10052:
--------------------------------

             Summary: [Python] Incrementally using ParquetWriter keeps data in memory (eventually running out of RAM for large datasets)
                 Key: ARROW-10052
                 URL: https://issues.apache.org/jira/browse/ARROW-10052
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 1.0.1
            Reporter: Niklas B


This ticket refers to the discussion between me and [~emkornfield] on the MailingList: "Incrementally using ParquetWriter without keeping entire dataset in memory (large than memory parquet files)" (not yet available on the mail archives)

Original post:
{quote}Hi,
 I'm trying to write a large parquet file onto disk (larger then memory) using PyArrows ParquetWriter and write_table, but even though the file is written incrementally to disk it still appears to keeps the entire dataset in memory (eventually getting OOM killed). Basically what I am trying to do is:
 with pq.ParquetWriter(
                 output_file,
                 arrow_schema,
                 compression='snappy',
                 allow_truncated_timestamps=True,
                 version='2.0',  # Highest available schema
                 data_page_version='2.0',  # Highest available schema
         ) as writer:
             for rows_dataframe in function_that_yields_data():
                 writer.write_table(
                     pa.Table.from_pydict(
                             rows_dataframe,
                             arrow_schema
                     )
                 )
 Where I have a function that yields data and then write it in chunks using write_table. 
 Is it possible to force the ParquetWriter to not keep the entire dataset in memory, or is it simply not possible for good reasons?
 I’m streaming data from a database and writes it to Parquet. The end-consumer has plenty of ram, but the machine that does the conversion doesn’t. 
 Regards,
 Niklas
{quote}
Minimum example (I can't attach as a file for some reason) [https://gist.github.com/bivald/2ddbc853ce8da9a9a064d8b56a93fc95]

Looking at it now when I've made a minimal example I see something I didn't see/realize before which is that while the memory usage is increasing it doesn't appear to be linear to the file written. This indicates (I guess) that it isn't actually storing the written dataset, but something else. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)