You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Michael Peleshenko (Jira)" <ji...@apache.org> on 2020/12/22 16:45:00 UTC
[jira] [Created] (ARROW-11007) [Python] Memory leak in
pq.read_table and table.to_pandas
Michael Peleshenko created ARROW-11007:
------------------------------------------
Summary: [Python] Memory leak in pq.read_table and table.to_pandas
Key: ARROW-11007
URL: https://issues.apache.org/jira/browse/ARROW-11007
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 2.0.0
Reporter: Michael Peleshenko
While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.
*Sample Code*
{code:python}
import io
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from memory_profiler import profile
@profile
def read_file(f):
table = pq.read_table(f)
df = table.to_pandas(strings_to_categorical=True)
del table
del df
def main():
rows = 2000000
df = pd.DataFrame({
"string": ["test"] * rows,
"int": [5] * rows,
"float": [2.0] * rows,
})
table = pa.Table.from_pandas(df, preserve_index=False)
parquet_stream = io.BytesIO()
pq.write_table(table, parquet_stream)
for i in range(3):
parquet_stream.seek(0)
read_file(parquet_stream)
if __name__ == '__main__':
main()
{code}
*Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
{code:java}
Filename: C:/run_pyarrow_memoy_leak_sample.py
Line # Mem usage Increment Occurences Line Contents
============================================================
9 161.7 MiB 161.7 MiB 1 @profile
10 def read_file(f):
11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f)
12 258.2 MiB 46.1 MiB 1 df = table.to_pandas(strings_to_categorical=True)
13 258.2 MiB 0.0 MiB 1 del table
14 256.3 MiB -1.9 MiB 1 del df
Filename: C:/run_pyarrow_memoy_leak_sample.py
Line # Mem usage Increment Occurences Line Contents
============================================================
9 256.3 MiB 256.3 MiB 1 @profile
10 def read_file(f):
11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f)
12 322.2 MiB 43.0 MiB 1 df = table.to_pandas(strings_to_categorical=True)
13 322.2 MiB 0.0 MiB 1 del table
14 320.3 MiB -1.9 MiB 1 del df
Filename: C:/run_pyarrow_memoy_leak_sample.py
Line # Mem usage Increment Occurences Line Contents
============================================================
9 320.3 MiB 320.3 MiB 1 @profile
10 def read_file(f):
11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f)
12 361.7 MiB 34.8 MiB 1 df = table.to_pandas(strings_to_categorical=True)
13 361.7 MiB 0.0 MiB 1 del table
14 359.8 MiB -1.9 MiB 1 del df
{code}
*Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
{code:java}
Filename: C:/run_pyarrow_memoy_leak_sample.py
Line # Mem usage Increment Occurences Line Contents
============================================================
9 138.4 MiB 138.4 MiB 1 @profile
10 def read_file(f):
11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f)
12 219.2 MiB 33.0 MiB 1 df = table.to_pandas(strings_to_categorical=True)
13 171.7 MiB -47.5 MiB 1 del table
14 139.3 MiB -32.4 MiB 1 del df
Filename: C:/run_pyarrow_memoy_leak_sample.py
Line # Mem usage Increment Occurences Line Contents
============================================================
9 139.3 MiB 139.3 MiB 1 @profile
10 def read_file(f):
11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f)
12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True)
13 171.5 MiB -47.7 MiB 1 del table
14 139.1 MiB -32.4 MiB 1 del df
Filename: C:/run_pyarrow_memoy_leak_sample.py
Line # Mem usage Increment Occurences Line Contents
============================================================
9 139.1 MiB 139.1 MiB 1 @profile
10 def read_file(f):
11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f)
12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True)
13 171.8 MiB -47.5 MiB 1 del table
14 139.3 MiB -32.4 MiB 1 del df
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)