You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Michael Peleshenko (Jira)" <ji...@apache.org> on 2020/12/22 16:45:00 UTC
[jira] [Created] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

Michael Peleshenko created ARROW-11007:
------------------------------------------

             Summary: [Python] Memory leak in pq.read_table and table.to_pandas
                 Key: ARROW-11007
                 URL: https://issues.apache.org/jira/browse/ARROW-11007
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 2.0.0
            Reporter: Michael Peleshenko


While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.

*Sample Code*
{code:python}
import io

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from memory_profiler import profile


@profile
def read_file(f):
    table = pq.read_table(f)
    df = table.to_pandas(strings_to_categorical=True)
    del table
    del df


def main():
    rows = 2000000
    df = pd.DataFrame({
        "string": ["test"] * rows,
        "int": [5] * rows,
        "float": [2.0] * rows,
    })
    table = pa.Table.from_pandas(df, preserve_index=False)
    parquet_stream = io.BytesIO()
    pq.write_table(table, parquet_stream)

    for i in range(3):
        parquet_stream.seek(0)
        read_file(parquet_stream)


if __name__ == '__main__':
    main()
{code}
*Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
{code:java}
Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    161.7 MiB    161.7 MiB           1   @profile
    10                                         def read_file(f):
    11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
    12    258.2 MiB     46.1 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    258.2 MiB      0.0 MiB           1       del table
    14    256.3 MiB     -1.9 MiB           1       del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    256.3 MiB    256.3 MiB           1   @profile
    10                                         def read_file(f):
    11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
    12    322.2 MiB     43.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    322.2 MiB      0.0 MiB           1       del table
    14    320.3 MiB     -1.9 MiB           1       del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    320.3 MiB    320.3 MiB           1   @profile
    10                                         def read_file(f):
    11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
    12    361.7 MiB     34.8 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    361.7 MiB      0.0 MiB           1       del table
    14    359.8 MiB     -1.9 MiB           1       del df
{code}
*Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
{code:java}
Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    138.4 MiB    138.4 MiB           1   @profile
    10                                         def read_file(f):
    11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
    12    219.2 MiB     33.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    171.7 MiB    -47.5 MiB           1       del table
    14    139.3 MiB    -32.4 MiB           1       del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    139.3 MiB    139.3 MiB           1   @profile
    10                                         def read_file(f):
    11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
    12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    171.5 MiB    -47.7 MiB           1       del table
    14    139.1 MiB    -32.4 MiB           1       del df


Filename: C:/run_pyarrow_memoy_leak_sample.py

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
     9    139.1 MiB    139.1 MiB           1   @profile
    10                                         def read_file(f):
    11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
    12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
    13    171.8 MiB    -47.5 MiB           1       del table
    14    139.3 MiB    -32.4 MiB           1       del df
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)