You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dmitry Kashtanov (Jira)" <ji...@apache.org> on 2021/02/05 11:34:00 UTC
[jira] [Commented] (ARROW-11007) [Python] Memory leak in
pq.read_table and table.to_pandas
[ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279642#comment-17279642 ]
Dmitry Kashtanov commented on ARROW-11007:
------------------------------------------
I have a somewhat similar issue observed both on `pyarrow` v1.0.0 and v3.0.0 in Linux (within Docker containers, {{`python:3.8-slim}}`-based image, local and AWS Fargate). The issue is with reading from BigQuery with BigQuery Storage API using ARROW data format. Under the hood it downloads a set of RecordBatches and combines them into a Table. After this in my code the Table is converted to a pandas DataFrame and then deleted, but the Table's memory is not released to OS.
This behavior remains also if I use `{{mimalloc}}` or `{{system}}`-based pools set either in code or via `{{ARROW_DEFAULT_MEMORY_POOL}}` environment variable.
Also after that I drop a referenced column (not copied) from that pandas DataFrame, this results in DataFrame data copy and the memory from the original DataFrame is also not released to OS. The subsequent transformations of the DataFrame release memory as expected.
The exactly same code with exactly same Python (3.8.7) and packages versions on MacOS releases memory to OS as expected (also will all kinds of the memory pool).
The very first lines of the script are:
{code:java}
import pyarrow
pyarrow.jemalloc_set_decay_ms(0)
{code}
Mac OS:
{code:java}
Line # Mem usage Increment Occurences Line Contents
============================================================
460 141.5 MiB 141.5 MiB 1 @profile
461 def bqs_stream_to_pandas(session, stream_name):
463 142.2 MiB 0.7 MiB 1 client = bqs.BigQueryReadClient()
464 158.7 MiB 16.5 MiB 1 reader = client.read_rows(name=stream_name, offset=0)
465 1092.2 MiB 933.5 MiB 1 table = reader.to_arrow(session)
470 2725.1 MiB 1632.5 MiB 2 dataset = table.to_pandas(deduplicate_objects=False, split_blocks=False, self_destruct=False,
471 1092.6 MiB 0.0 MiB 1 strings_to_categorical=True,)
472 1405.0 MiB -1320.1 MiB 1 del table
473 1405.0 MiB 0.0 MiB 1 del reader
474 1396.1 MiB -8.9 MiB 1 del client
475 1396.1 MiB 0.0 MiB 1 time.sleep(1)
476 1396.1 MiB 0.0 MiB 1 if MEM_PROFILING:
477 1396.1 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
478 1396.1 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
479 f"{mem_pool.max_memory()} max allocated, ")
480 1396.1 MiB 0.0 MiB 1 print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
481 1402.4 MiB 6.3 MiB 1 mem_usage = dataset.memory_usage(index=True, deep=True)
485 1404.2 MiB 0.0 MiB 1 return dataset
# Output
PyArrow mem pool info: jemalloc backend, 1313930816 allocated, 1340417472 max allocated,
PyArrow total allocated bytes: 1313930816
Line # Mem usage Increment Occurences Line Contents
============================================================
...
139 1477.7 MiB 0.4 MiB 1 dataset_label = dataset[label_column].astype(np.int8)
140
141 1474.2 MiB -3.5 MiB 1 dataset.drop(columns=label_column, inplace=True)
142 1474.2 MiB 0.0 MiB 1 gc.collect()
143
144 1474.2 MiB 0.0 MiB 1 if MEM_PROFILING:
145 1474.2 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
146 1474.2 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
147 f"{mem_pool.max_memory()} max allocated, ")
148 1474.2 MiB 0.0 MiB 1 print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
# Output
PyArrow mem pool info: jemalloc backend, 0 allocated, 1340417472 max allocated,
PyArrow total allocated bytes: 0
{code}
{{Linux (`python:3.8-slim}}`-based image{{):}}
{code:java}
Line # Mem usage Increment Occurences Line Contents
============================================================
460 153.0 MiB 153.0 MiB 1 @profile
461 def bqs_stream_to_pandas(session, stream_name):
463 153.5 MiB 0.6 MiB 1 client = bqs.BigQueryReadClient()
464 166.9 MiB 13.4 MiB 1 reader = client.read_rows(name=stream_name, offset=0)
465 1567.5 MiB 1400.6 MiB 1 table = reader.to_arrow(session)
469 1567.5 MiB 0.0 MiB 1 report_metric('piano.ml.preproc.pyarrow.table.bytes', table.nbytes)
470 2843.7 MiB 1276.2 MiB 2 dataset = table.to_pandas(deduplicate_objects=False, split_blocks=False, self_destruct=False,
471 1567.5 MiB 0.0 MiB 1 strings_to_categorical=True,)
472 2843.7 MiB 0.0 MiB 1 del table
473 2843.7 MiB 0.0 MiB 1 del reader
474 2843.9 MiB 0.2 MiB 1 del client
475 2842.2 MiB -1.8 MiB 1 time.sleep(1)
476 2842.2 MiB 0.0 MiB 1 if MEM_PROFILING:
477 2842.2 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
478 2842.2 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
479 f"{mem_pool.max_memory()} max allocated, ")
480 2842.2 MiB 0.0 MiB 1 print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
481 2838.9 MiB -3.3 MiB 1 mem_usage = dataset.memory_usage(index=True, deep=True)
485 2839.1 MiB 0.0 MiB 1 return dataset
# Output
PyArrow mem pool info: jemalloc backend, 1313930816 allocated, 1338112064 max allocated,
PyArrow total allocated bytes: 1313930816
Line # Mem usage Increment Occurences Line Contents
============================================================
...
139 2839.1 MiB 0.0 MiB 1 dataset_label = dataset[label_column].astype(np.int8)
140
141 2836.6 MiB -2.6 MiB 1 dataset.drop(columns=label_column, inplace=True)
142 2836.6 MiB 0.0 MiB 1 gc.collect()
143
144 2836.6 MiB 0.0 MiB 1 if MEM_PROFILING:
145 2836.6 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
146 2836.6 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
147 f"{mem_pool.max_memory()} max allocated, ")
148 2836.6 MiB 0.0 MiB 1 print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
# Output
PyArrow mem pool info: jemalloc backend, 0 allocated, 1338112064 max allocated,
PyArrow total allocated bytes: 0
{code}
A case with dropping a referenced (not copied) column:
{code:java}
Line # Mem usage Increment Occurences Line Contents
============================================================
...
134 2872.0 MiB 0.0 MiB 1 dataset_label = dataset[label_column]
135
136 4039.4 MiB 1167.4 MiB 1 dataset.drop(columns=label_column, inplace=True)
137 4035.9 MiB -3.6 MiB 1 gc.collect()
138
139 4035.9 MiB 0.0 MiB 1 if MEM_PROFILING:
140 4035.9 MiB 0.0 MiB 1 mem_pool = pyarrow.default_memory_pool()
141 4035.9 MiB 0.0 MiB 1 print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
142 f"{mem_pool.max_memory()} max allocated, ")
# Output
PyArrow mem pool info: jemalloc backend, 90227904 allocated, 1340299200 max allocated,
{code}
Package versions:
{code:java}
boto3==1.17.1
botocore==1.20.1
cachetools==4.2.1
certifi==2020.12.5
cffi==1.14.4
chardet==4.0.0
google-api-core[grpc]==1.25.1
google-auth==1.25.0
google-cloud-bigquery-storage==2.2.1
google-cloud-bigquery==2.7.0
google-cloud-core==1.5.0
google-crc32c==1.1.2
google-resumable-media==1.2.0
googleapis-common-protos==1.52.0
grpcio==1.35.0
idna==2.10
jmespath==0.10.0
joblib==1.0.0
libcst==0.3.16
memory-profiler==0.58.0
mypy-extensions==0.4.3
numpy==1.20.0
pandas==1.2.1
proto-plus==1.13.0
protobuf==3.14.0
psutil==5.8.0
pyarrow==3.0.0
pyasn1-modules==0.2.8
pyasn1==0.4.8
pycparser==2.20
python-dateutil==2.8.1
pytz==2021.1
pyyaml==5.4.1
requests==2.25.1
rsa==4.7
s3transfer==0.3.4
scikit-learn==0.24.1
scipy==1.6.0
setuptools-scm==5.0.1
six==1.15.0
smart-open==4.1.2
threadpoolctl==2.1.0
typing-extensions==3.7.4.3
typing-inspect==0.6.0
unidecode==1.1.2
urllib3==1.26.3
{code}
> [Python] Memory leak in pq.read_table and table.to_pandas
> ---------------------------------------------------------
>
> Key: ARROW-11007
> URL: https://issues.apache.org/jira/browse/ARROW-11007
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Michael Peleshenko
> Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
> table = pq.read_table(f)
> df = table.to_pandas(strings_to_categorical=True)
> del table
> del df
> def main():
> rows = 2000000
> df = pd.DataFrame({
> "string": ["test"] * rows,
> "int": [5] * rows,
> "float": [2.0] * rows,
> })
> table = pa.Table.from_pandas(df, preserve_index=False)
> parquet_stream = io.BytesIO()
> pq.write_table(table, parquet_stream)
> for i in range(3):
> parquet_stream.seek(0)
> read_file(parquet_stream)
> if __name__ == '__main__':
> main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 161.7 MiB 161.7 MiB 1 @profile
> 10 def read_file(f):
> 11 212.1 MiB 50.4 MiB 1 table = pq.read_table(f)
> 12 258.2 MiB 46.1 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 258.2 MiB 0.0 MiB 1 del table
> 14 256.3 MiB -1.9 MiB 1 del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 256.3 MiB 256.3 MiB 1 @profile
> 10 def read_file(f):
> 11 279.2 MiB 23.0 MiB 1 table = pq.read_table(f)
> 12 322.2 MiB 43.0 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 322.2 MiB 0.0 MiB 1 del table
> 14 320.3 MiB -1.9 MiB 1 del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 320.3 MiB 320.3 MiB 1 @profile
> 10 def read_file(f):
> 11 326.9 MiB 6.5 MiB 1 table = pq.read_table(f)
> 12 361.7 MiB 34.8 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 361.7 MiB 0.0 MiB 1 del table
> 14 359.8 MiB -1.9 MiB 1 del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 138.4 MiB 138.4 MiB 1 @profile
> 10 def read_file(f):
> 11 186.2 MiB 47.8 MiB 1 table = pq.read_table(f)
> 12 219.2 MiB 33.0 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 171.7 MiB -47.5 MiB 1 del table
> 14 139.3 MiB -32.4 MiB 1 del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 139.3 MiB 139.3 MiB 1 @profile
> 10 def read_file(f):
> 11 186.8 MiB 47.5 MiB 1 table = pq.read_table(f)
> 12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 171.5 MiB -47.7 MiB 1 del table
> 14 139.1 MiB -32.4 MiB 1 del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line # Mem usage Increment Occurences Line Contents
> ============================================================
> 9 139.1 MiB 139.1 MiB 1 @profile
> 10 def read_file(f):
> 11 186.8 MiB 47.7 MiB 1 table = pq.read_table(f)
> 12 219.2 MiB 32.4 MiB 1 df = table.to_pandas(strings_to_categorical=True)
> 13 171.8 MiB -47.5 MiB 1 del table
> 14 139.3 MiB -32.4 MiB 1 del df
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)