You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/09/18 21:49:00 UTC
[jira] [Commented] (ARROW-5086) [Python] Space leak in ParquetFile.read_row_group()

    [ https://issues.apache.org/jira/browse/ARROW-5086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932871#comment-16932871 ] 

Wes McKinney commented on ARROW-5086:
-------------------------------------

I've been looking at this for about an hour. This is really strange, here is my example code I'm using to investigate:

https://gist.github.com/wesm/27a1c65aa8329855ff80dd0157553fa5

here is the output

https://gist.github.com/wesm/8ad9f224b64862ca31c28183effa82b4

weirdly on each iteration, RSS goes up by about ~8MB which is the amount of Arrow memory allocated on each iteration, even though the memory pool is claiming that the memory is being released. But then once the file reader object goes out of scope, RSS is released in bulk. 

I suspect that there is a rogue heap allocation someplace but I haven't found it yet. I checked that the destructors in the various C++ objects are firing on each iteration and no dice yet

> [Python] Space leak in  ParquetFile.read_row_group()
> ----------------------------------------------------
>
>                 Key: ARROW-5086
>                 URL: https://issues.apache.org/jira/browse/ARROW-5086
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.12.1
>            Reporter: Jakub Okoński
>            Assignee: Wes McKinney
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.15.0
>
>         Attachments: all.png, all.png
>
>
> I have a code pattern like this:
>  
> reader = pq.ParquetFile(path)
> for ix in range(0, reader.num_row_groups):
>     table = reader.read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> But it leaks memory over time, only releasing it when the reader object is collected. Here's a workaround
>  
> num_row_groups = pq.ParquetFile(path).num_row_groups
> for ix in range(0, num_row_groups):
>     table = pq.ParquetFile(path).read_row_group(ix, columns=self._columns)
>     # operate on table
>  
> This puts an upper bound on memory usage and is what I'd  expect from the code. I also put gc.collect() to the end of every loop.
>  
> I charted out memory usage for a small benchmark that just copies a file, one row group at a time, converting to pandas and back to arrow on the writer path. Line in black is the first one, using a single reader object. Blue is instantiating a fresh reader in every iteration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)