You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "shadowdsp (Jira)" <ji...@apache.org> on 2021/03/03 09:09:00 UTC
[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

    [ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294410#comment-17294410 ] 

shadowdsp commented on ARROW-11007:
-----------------------------------

I have the similar issue in `nested data` on Ubuntu16.04 pyarrow v3.0, even if I set `pa.jemalloc_set_decay_ms(0)`. But `non-nested data` can work well.
 
Here is my script:
{code:java}
import io
import pandas as pd
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
import pyarrow.parquet as pq
from memory_profiler import profile

@profile
def read_file(f):
    table = pq.read_table(f)
    df = table.to_pandas(strings_to_categorical=True)
    del table
    del df

def main():
    rows = 2000000
    df = pd.DataFrame({
        "string": [{"test": [1, 2], "test1": [3, 4]}] * rows,
        "int": [5] * rows,
        "float": [2.0] * rows,
    })
    table = pa.Table.from_pandas(df, preserve_index=False)
    parquet_stream = io.BytesIO()
    pq.write_table(table, parquet_stream)
    for i in range(3):
        parquet_stream.seek(0)
        read_file(parquet_stream)

if __name__ == '__main__':
    main()
{code}
Output:
{code:java}
Filename: memory_leak.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    14    329.5 MiB    329.5 MiB           1   @profile
    15                                         def read_file(f):
    16    424.4 MiB     94.9 MiB           1       table = pq.read_table(f)
    17   1356.6 MiB    932.2 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    18   1310.5 MiB    -46.1 MiB           1       del table
    19    606.7 MiB   -703.8 MiB           1       del df


Filename: memory_leak.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    14    606.7 MiB    606.7 MiB           1   @profile
    15                                         def read_file(f):
    16    714.9 MiB    108.3 MiB           1       table = pq.read_table(f)
    17   1720.8 MiB   1005.9 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    18   1674.5 MiB    -46.3 MiB           1       del table
    19    970.6 MiB   -703.8 MiB           1       del df


Filename: memory_leak.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
    14    970.6 MiB    970.6 MiB           1   @profile
    15                                         def read_file(f):
    16   1079.6 MiB    109.0 MiB           1       table = pq.read_table(f)
    17   2085.5 MiB   1005.9 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    18   2039.2 MiB    -46.3 MiB           1       del table
    19   1335.3 MiB   -703.8 MiB           1       del df
{code}
`df` and `table` cannot fully release in this case.
 
pkg info
{code:java}
                                                                                                      
▶ pip show pyarrow     
Name: pyarrow
Version: 3.0.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: None
Author-email: None
License: Apache License, Version 2.0
Location: 
Requires: numpy
Required-by: utify
                                                                                                    
▶ pip show pandas 
Name: pandas
Version: 1.2.1
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: 
Requires: python-dateutil, pytz, numpy
Required-by: utify, seaborn, fastparquet
{code}

> [Python] Memory leak in pq.read_table and table.to_pandas
> ---------------------------------------------------------
>
>                 Key: ARROW-11007
>                 URL: https://issues.apache.org/jira/browse/ARROW-11007
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Michael Peleshenko
>            Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
>     table = pq.read_table(f)
>     df = table.to_pandas(strings_to_categorical=True)
>     del table
>     del df
> def main():
>     rows = 2000000
>     df = pd.DataFrame({
>         "string": ["test"] * rows,
>         "int": [5] * rows,
>         "float": [2.0] * rows,
>     })
>     table = pa.Table.from_pandas(df, preserve_index=False)
>     parquet_stream = io.BytesIO()
>     pq.write_table(table, parquet_stream)
>     for i in range(3):
>         parquet_stream.seek(0)
>         read_file(parquet_stream)
> if __name__ == '__main__':
>     main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    161.7 MiB    161.7 MiB           1   @profile
>     10                                         def read_file(f):
>     11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
>     12    258.2 MiB     46.1 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    258.2 MiB      0.0 MiB           1       del table
>     14    256.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    256.3 MiB    256.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
>     12    322.2 MiB     43.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    322.2 MiB      0.0 MiB           1       del table
>     14    320.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    320.3 MiB    320.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
>     12    361.7 MiB     34.8 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    361.7 MiB      0.0 MiB           1       del table
>     14    359.8 MiB     -1.9 MiB           1       del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    138.4 MiB    138.4 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     33.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.7 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.3 MiB    139.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.5 MiB    -47.7 MiB           1       del table
>     14    139.1 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.1 MiB    139.1 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.8 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)