You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Stig Korsnes (Jira)" <ji...@apache.org> on 2022/03/11 20:27:00 UTC

[jira] [Updated] (ARROW-15920) Memory usage RecordBatchStreamWriter

     [ https://issues.apache.org/jira/browse/ARROW-15920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stig Korsnes updated ARROW-15920:
---------------------------------
    Attachment: mem.png

> Memory usage RecordBatchStreamWriter
> ------------------------------------
>
>                 Key: ARROW-15920
>                 URL: https://issues.apache.org/jira/browse/ARROW-15920
>             Project: Apache Arrow
>          Issue Type: Wish
>    Affects Versions: 7.0.0
>         Environment: Windows 11 , Python 3.9.2
>            Reporter: Stig Korsnes
>            Priority: Major
>         Attachments: demo.py, mem.png
>
>
> Hi.
> I have a monte-carlo calcuator that yields a couple of hundred Nx1 numpy arrays. I need to develop further functionality on it, and since it can`t be solved easily without having access to the full set I`m pursuing the route of exporting them. Found PyArrow and got exited. First wall I hit, was that the writer could not write "columns" (IPC). A stackoverflow post, and two weeks later, I`m writing my arrays to single file-single column with a stream writer ,using write_table and chunksize (write_batch has no such parameter) .I`m then combining all files to a single file by using a reader for every file and reading batches. I then combine them to a single recordbatch and write. The whole idea is that I can later pull in parts of the complete set/all columns (which would fit in memory) and  process further. Now, everything works, but following along on my task manager, I see that memory simply skyrockets when I write. I would expect memory consumption to stay around the size of my group batches and then some. The whole point of this exercise is having stuff fit in memory, and I can not see how I can achieve this. It makes me wonder if I`m a complete idiot when I read [efficiently-writing-and-reading-arrow-data|[https://arrow.apache.org/docs/python/ipc.html#efficiently-writing-and-reading-arrow-data],] have I done something wrong or am I looking at it wrong? I have attached a python file with a simple attempt. I have tried the filewriters, doing Tables instead of batches and refactoring in all thinkable ways.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)