You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "shadowdsp (Jira)" <ji...@apache.org> on 2021/03/03 09:09:00 UTC
[jira] [Commented] (ARROW-11007) [Python] Memory leak in
pq.read_table and table.to_pandas
[ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294410#comment-17294410 ]
shadowdsp commented on ARROW-11007:
-----------------------------------
I have the similar issue in `nested data` on Ubuntu16.04 pyarrow v3.0, even if I set `pa.jemalloc_set_decay_ms(0)`. But `non-nested data` can work well.
Here is my script:
{code:java}
import io
import pandas as pd
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
import pyarrow.parquet as pq
from memory_profiler import profile
@profile
def read_file(f):
table = pq.read_table(f)
df = table.to_pandas(strings_to_categorical=True)
del table
del df
def main():
rows = 2000000
df = pd.DataFrame({
"string": [{"test": [1, 2], "test1": [3, 4]}] * rows,
"int": [5] * rows,
"float": [2.0] * rows,
})
table = pa.Table.from_pandas(df, preserve_index=False)
parquet_stream = io.BytesIO()
pq.write_table(table, parquet_stream)
for i in range(3):
parquet_stream.seek(0)
read_file(parquet_stream)
if __name__ == '__main__':
main()
{code}
Output:
{code:java}
Filename: memory_leak.py
Line # Mem usage Increment Occurences Line Contents
============================================================
14 329.5 MiB 329.5 MiB 1 @profile
15 def read_file(f):
16 424.4 MiB 94.9 MiB 1 table = pq.read_table(f)
17 1356.6 MiB 932.2 MiB 1 df = table.to_pandas(strings_to_categorical=True)
18 1310.5 MiB -46.1 MiB 1 del table
19 606.7 MiB -703.8 MiB 1 del df
Filename: memory_leak.py
Line # Mem usage Increment Occurences Line Contents
============================================================
14 606.7 MiB 606.7 MiB 1 @profile
15 def read_file(f):
16 714.9 MiB 108.3 MiB 1 table = pq.read_table(f)
17 1720.8 MiB 1005.9 MiB 1 df = table.to_pandas(strings_to_categorical=True)
18 1674.5 MiB -46.3 MiB 1 del table
19 970.6 MiB -703.8 MiB 1 del df
Filename: memory_leak.py
Line # Mem usage Increment Occurences Line Contents
============================================================
14 970.6 MiB 970.6 MiB 1 @profile
15 def read_file(f):
16 1079.6 MiB 109.0 MiB 1 table = pq.read_table(f)
17 2085.5 MiB 1005.9 MiB 1 df = table.to_pandas(strings_to_categorical=True)
18 2039.2 MiB -46.3 MiB 1 del table
19 1335.3 MiB -703.8 MiB 1 del df
{code}
`df` and `table` cannot fully release in this case.
pkg info
{code:java}
▶ pip show pyarrow
Name: pyarrow
Version: 3.0.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: None
Author-email: None
License: Apache License, Version 2.0
Location:
Requires: numpy
Required-by: utify
▶ pip show pandas
Name: pandas
Version: 1.2.1
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location:
Requires: python-dateutil, pytz, numpy
Required-by: utify, seaborn, fastparquet
{code}
> [Python] Memory leak in pq.read_table and table.to_pandas
> ---------------------------------------------------------
>
> Key: ARROW-11007
> URL: https://issues.apache.org/jira/browse/ARROW-11007
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Michael Peleshenko
> Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
> table = pq.read_table(f)
> df = table.to_pandas(strings_to_categorical=True)
> del table
> del df
> def main():
> rows = 2000000
> df = pd.DataFrame({
> "string": ["test"] * rows,
> "int": [5] * rows,
> "float": [2.0] * rows,
> })
> table = pa.Table.from_pandas(df, preserve_index=False)
> parquet_stream = io.BytesIO()
> pq.write_table(table, parquet_stream)
> for i in range(3):
> parquet_stream.seek(0)
> read_file(parquet_stream)
> if __name__ == '__main__':
> main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 161.7 MiB 161.7 MiB 1 @profile
> 10 def read_file(f):
> 11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f)
> 12 258.2 MiB 46.1 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 258.2 MiB 0.0 MiB 1 del table
> 14 256.3 MiB -1.9 MiB 1 del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 256.3 MiB 256.3 MiB 1 @profile
> 10 def read_file(f):
> 11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f)
> 12 322.2 MiB 43.0 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 322.2 MiB 0.0 MiB 1 del table
> 14 320.3 MiB -1.9 MiB 1 del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 320.3 MiB 320.3 MiB 1 @profile
> 10 def read_file(f):
> 11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f)
> 12 361.7 MiB 34.8 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 361.7 MiB 0.0 MiB 1 del table
> 14 359.8 MiB -1.9 MiB 1 del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 138.4 MiB 138.4 MiB 1 @profile
> 10 def read_file(f):
> 11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f)
> 12 219.2 MiB 33.0 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 171.7 MiB -47.5 MiB 1 del table
> 14 139.3 MiB -32.4 MiB 1 del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 139.3 MiB 139.3 MiB 1 @profile
> 10 def read_file(f):
> 11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f)
> 12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 171.5 MiB -47.7 MiB 1 del table
> 14 139.1 MiB -32.4 MiB 1 del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 139.1 MiB 139.1 MiB 1 @profile
> 10 def read_file(f):
> 11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f)
> 12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 171.8 MiB -47.5 MiB 1 del table
> 14 139.3 MiB -32.4 MiB 1 del df
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)