You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2020/12/22 18:34:00 UTC
[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

    [ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253676#comment-17253676 ] 

Weston Pace commented on ARROW-11007:
-------------------------------------

Hello, thank you for writing up this analysis.  Pyarrow uses jemalloc, a custom memory allocator which does its best to hold onto memory allocated from the OS (since this can be an expensive operation).  Unfortunately, this makes it difficult to track line by line memory usage with tools like memory_profiler.  There are a couple of options:

* You could use [https://arrow.apache.org/docs/python/generated/pyarrow.total_allocated_bytes.html#pyarrow.total_allocated_bytes] to track allocation instead of using memory_profiler (it might be interesting to see if there is a way to get memory_profile to use this function instead of kernel statistics).

* You can also put the following line at the top of your script, this will configure jemalloc to release memory immediately instead of holding on to it (this will likely have some performance implications):

pa.jemalloc_set_decay_ms(0)

 

The behavior you are seeing is pretty typical for jemalloc.  For further reading, in addition to reading up on jemalloc itself, I encourage you to take a look at these other issues for more discussions and examples of jemalloc behaviors:

 

https://issues.apache.org/jira/browse/ARROW-6910

https://issues.apache.org/jira/browse/ARROW-7305

 

I have run your test read 10,000 times and it seems that memory usage does predictably stabilize.  In addition, total_allocated_bytes is behaving exactly as expected.  So I do not believe there is any evidence of a memory leak in this script.

> [Python] Memory leak in pq.read_table and table.to_pandas
> ---------------------------------------------------------
>
>                 Key: ARROW-11007
>                 URL: https://issues.apache.org/jira/browse/ARROW-11007
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Michael Peleshenko
>            Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
>     table = pq.read_table(f)
>     df = table.to_pandas(strings_to_categorical=True)
>     del table
>     del df
> def main():
>     rows = 2000000
>     df = pd.DataFrame({
>         "string": ["test"] * rows,
>         "int": [5] * rows,
>         "float": [2.0] * rows,
>     })
>     table = pa.Table.from_pandas(df, preserve_index=False)
>     parquet_stream = io.BytesIO()
>     pq.write_table(table, parquet_stream)
>     for i in range(3):
>         parquet_stream.seek(0)
>         read_file(parquet_stream)
> if __name__ == '__main__':
>     main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    161.7 MiB    161.7 MiB           1   @profile
>     10                                         def read_file(f):
>     11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
>     12    258.2 MiB     46.1 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    258.2 MiB      0.0 MiB           1       del table
>     14    256.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    256.3 MiB    256.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
>     12    322.2 MiB     43.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    322.2 MiB      0.0 MiB           1       del table
>     14    320.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    320.3 MiB    320.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
>     12    361.7 MiB     34.8 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    361.7 MiB      0.0 MiB           1       del table
>     14    359.8 MiB     -1.9 MiB           1       del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    138.4 MiB    138.4 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     33.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.7 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.3 MiB    139.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.5 MiB    -47.7 MiB           1       del table
>     14    139.1 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.1 MiB    139.1 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.8 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)