You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dmitry Kashtanov (Jira)" <ji...@apache.org> on 2021/02/05 11:34:00 UTC
[jira] [Commented] (ARROW-11007) [Python] Memory leak in pq.read_table and table.to_pandas

    [ https://issues.apache.org/jira/browse/ARROW-11007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279642#comment-17279642 ] 

Dmitry Kashtanov commented on ARROW-11007:
------------------------------------------

I have a somewhat similar issue observed both on `pyarrow` v1.0.0 and v3.0.0 in Linux (within Docker containers, {{`python:3.8-slim}}`-based image, local and AWS Fargate). The issue is with reading from BigQuery with BigQuery Storage API using ARROW data format. Under the hood it downloads a set of RecordBatches and combines them into a Table. After this in my code the Table is converted to a pandas DataFrame and then deleted, but the Table's memory is not released to OS.

This behavior remains also if I use `{{mimalloc}}` or `{{system}}`-based pools set either in code or via `{{ARROW_DEFAULT_MEMORY_POOL}}` environment variable.

Also after that I drop a referenced column (not copied) from that pandas DataFrame, this results in DataFrame data copy and the memory from the original DataFrame is also not released to OS. The subsequent transformations of the DataFrame release memory as expected.

The exactly same code with exactly same Python (3.8.7) and packages versions on MacOS releases memory to OS as expected (also will all kinds of the memory pool).

 

The very first lines of the script are:
{code:java}
import pyarrow
pyarrow.jemalloc_set_decay_ms(0)
{code}
 

Mac OS:

 
{code:java}
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   460    141.5 MiB    141.5 MiB           1   @profile
   461                                         def bqs_stream_to_pandas(session, stream_name):
   463    142.2 MiB      0.7 MiB           1       client = bqs.BigQueryReadClient()
   464    158.7 MiB     16.5 MiB           1       reader = client.read_rows(name=stream_name, offset=0)
   465   1092.2 MiB    933.5 MiB           1       table = reader.to_arrow(session)
   470   2725.1 MiB   1632.5 MiB           2       dataset = table.to_pandas(deduplicate_objects=False, split_blocks=False, self_destruct=False,
   471   1092.6 MiB      0.0 MiB           1                                 strings_to_categorical=True,)
   472   1405.0 MiB  -1320.1 MiB           1       del table
   473   1405.0 MiB      0.0 MiB           1       del reader
   474   1396.1 MiB     -8.9 MiB           1       del client
   475   1396.1 MiB      0.0 MiB           1       time.sleep(1)
   476   1396.1 MiB      0.0 MiB           1       if MEM_PROFILING:
   477   1396.1 MiB      0.0 MiB           1           mem_pool = pyarrow.default_memory_pool()
   478   1396.1 MiB      0.0 MiB           1           print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
   479                                                       f"{mem_pool.max_memory()} max allocated, ")
   480   1396.1 MiB      0.0 MiB           1           print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
   481   1402.4 MiB      6.3 MiB           1       mem_usage = dataset.memory_usage(index=True, deep=True)
   485   1404.2 MiB      0.0 MiB           1       return dataset

# Output
PyArrow mem pool info: jemalloc backend, 1313930816 allocated, 1340417472 max allocated,
PyArrow total allocated bytes: 1313930816

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
...
   139   1477.7 MiB      0.4 MiB           1           dataset_label = dataset[label_column].astype(np.int8)
   140
   141   1474.2 MiB     -3.5 MiB           1           dataset.drop(columns=label_column, inplace=True)
   142   1474.2 MiB      0.0 MiB           1           gc.collect()
   143
   144   1474.2 MiB      0.0 MiB           1           if MEM_PROFILING:
   145   1474.2 MiB      0.0 MiB           1               mem_pool = pyarrow.default_memory_pool()
   146   1474.2 MiB      0.0 MiB           1               print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
   147                                                           f"{mem_pool.max_memory()} max allocated, ")
   148   1474.2 MiB      0.0 MiB           1               print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")

# Output
PyArrow mem pool info: jemalloc backend, 0 allocated, 1340417472 max allocated,
PyArrow total allocated bytes: 0
{code}
 

 

{{Linux (`python:3.8-slim}}`-based image{{):}}

 
{code:java}
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
   460    153.0 MiB    153.0 MiB           1   @profile
   461                                         def bqs_stream_to_pandas(session, stream_name):
   463    153.5 MiB      0.6 MiB           1       client = bqs.BigQueryReadClient()
   464    166.9 MiB     13.4 MiB           1       reader = client.read_rows(name=stream_name, offset=0)
   465   1567.5 MiB   1400.6 MiB           1       table = reader.to_arrow(session)
   469   1567.5 MiB      0.0 MiB           1       report_metric('piano.ml.preproc.pyarrow.table.bytes', table.nbytes)
   470   2843.7 MiB   1276.2 MiB           2       dataset = table.to_pandas(deduplicate_objects=False, split_blocks=False, self_destruct=False,
   471   1567.5 MiB      0.0 MiB           1                                 strings_to_categorical=True,)
   472   2843.7 MiB      0.0 MiB           1       del table
   473   2843.7 MiB      0.0 MiB           1       del reader
   474   2843.9 MiB      0.2 MiB           1       del client
   475   2842.2 MiB     -1.8 MiB           1       time.sleep(1)
   476   2842.2 MiB      0.0 MiB           1       if MEM_PROFILING:
   477   2842.2 MiB      0.0 MiB           1           mem_pool = pyarrow.default_memory_pool()
   478   2842.2 MiB      0.0 MiB           1           print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
   479                                                       f"{mem_pool.max_memory()} max allocated, ")
   480   2842.2 MiB      0.0 MiB           1           print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")
   481   2838.9 MiB     -3.3 MiB           1           mem_usage = dataset.memory_usage(index=True, deep=True)
   485   2839.1 MiB      0.0 MiB           1       return dataset

# Output
PyArrow mem pool info: jemalloc backend, 1313930816 allocated, 1338112064 max allocated,
PyArrow total allocated bytes: 1313930816

Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
...
   139   2839.1 MiB      0.0 MiB           1           dataset_label = dataset[label_column].astype(np.int8)
   140
   141   2836.6 MiB     -2.6 MiB           1           dataset.drop(columns=label_column, inplace=True)
   142   2836.6 MiB      0.0 MiB           1           gc.collect()
   143
   144   2836.6 MiB      0.0 MiB           1           if MEM_PROFILING:
   145   2836.6 MiB      0.0 MiB           1               mem_pool = pyarrow.default_memory_pool()
   146   2836.6 MiB      0.0 MiB           1               print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
   147                                                           f"{mem_pool.max_memory()} max allocated, ")
   148   2836.6 MiB      0.0 MiB           1               print(f"PyArrow total allocated bytes: {pyarrow.total_allocated_bytes()}")

# Output
PyArrow mem pool info: jemalloc backend, 0 allocated, 1338112064 max allocated,
PyArrow total allocated bytes: 0
{code}
 

 

A case with dropping a referenced (not copied) column:

 
{code:java}
Line #    Mem usage    Increment  Occurences   Line Contents
============================================================
...
   134   2872.0 MiB      0.0 MiB           1           dataset_label = dataset[label_column]
   135
   136   4039.4 MiB   1167.4 MiB           1           dataset.drop(columns=label_column, inplace=True)
   137   4035.9 MiB     -3.6 MiB           1               gc.collect()
   138
   139   4035.9 MiB      0.0 MiB           1           if MEM_PROFILING:
   140   4035.9 MiB      0.0 MiB           1               mem_pool = pyarrow.default_memory_pool()
   141   4035.9 MiB      0.0 MiB           1               print(f"PyArrow mem pool info: {mem_pool.backend_name} backend, {mem_pool.bytes_allocated()} allocated, "
   142                                                           f"{mem_pool.max_memory()} max allocated, ")

# Output
PyArrow mem pool info: jemalloc backend, 90227904 allocated, 1340299200 max allocated,
{code}
 

 

Package versions:

 
{code:java}
boto3==1.17.1
botocore==1.20.1
cachetools==4.2.1
certifi==2020.12.5
cffi==1.14.4
chardet==4.0.0
google-api-core[grpc]==1.25.1
google-auth==1.25.0
google-cloud-bigquery-storage==2.2.1
google-cloud-bigquery==2.7.0
google-cloud-core==1.5.0
google-crc32c==1.1.2
google-resumable-media==1.2.0
googleapis-common-protos==1.52.0
grpcio==1.35.0
idna==2.10
jmespath==0.10.0
joblib==1.0.0
libcst==0.3.16
memory-profiler==0.58.0
mypy-extensions==0.4.3
numpy==1.20.0
pandas==1.2.1
proto-plus==1.13.0
protobuf==3.14.0
psutil==5.8.0
pyarrow==3.0.0
pyasn1-modules==0.2.8
pyasn1==0.4.8
pycparser==2.20
python-dateutil==2.8.1
pytz==2021.1
pyyaml==5.4.1
requests==2.25.1
rsa==4.7
s3transfer==0.3.4
scikit-learn==0.24.1
scipy==1.6.0
setuptools-scm==5.0.1
six==1.15.0
smart-open==4.1.2
threadpoolctl==2.1.0
typing-extensions==3.7.4.3
typing-inspect==0.6.0
unidecode==1.1.2
urllib3==1.26.3
{code}
 

 

> [Python] Memory leak in pq.read_table and table.to_pandas
> ---------------------------------------------------------
>
>                 Key: ARROW-11007
>                 URL: https://issues.apache.org/jira/browse/ARROW-11007
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Michael Peleshenko
>            Priority: Major
>
> While upgrading our application to use pyarrow 2.0.0 instead of 0.12.1, we observed a memory leak in the read_table and to_pandas methods. See below for sample code to reproduce it. Memory does not seem to be returned after deleting the table and df as it was in pyarrow 0.12.1.
> *Sample Code*
> {code:python}
> import io
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> from memory_profiler import profile
> @profile
> def read_file(f):
>     table = pq.read_table(f)
>     df = table.to_pandas(strings_to_categorical=True)
>     del table
>     del df
> def main():
>     rows = 2000000
>     df = pd.DataFrame({
>         "string": ["test"] * rows,
>         "int": [5] * rows,
>         "float": [2.0] * rows,
>     })
>     table = pa.Table.from_pandas(df, preserve_index=False)
>     parquet_stream = io.BytesIO()
>     pq.write_table(table, parquet_stream)
>     for i in range(3):
>         parquet_stream.seek(0)
>         read_file(parquet_stream)
> if __name__ == '__main__':
>     main()
> {code}
> *Python 3.8.5 (conda), pyarrow 2.0.0 (pip), pandas 1.1.2 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    161.7 MiB    161.7 MiB           1   @profile
>     10                                         def read_file(f):
>     11    212.1 MiB     50.4 MiB           1       table = pq.read_table(f)
>     12    258.2 MiB     46.1 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    258.2 MiB      0.0 MiB           1       del table
>     14    256.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    256.3 MiB    256.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    279.2 MiB     23.0 MiB           1       table = pq.read_table(f)
>     12    322.2 MiB     43.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    322.2 MiB      0.0 MiB           1       del table
>     14    320.3 MiB     -1.9 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    320.3 MiB    320.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    326.9 MiB      6.5 MiB           1       table = pq.read_table(f)
>     12    361.7 MiB     34.8 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    361.7 MiB      0.0 MiB           1       del table
>     14    359.8 MiB     -1.9 MiB           1       del df
> {code}
> *Python 3.5.6 (conda), pyarrow 0.12.1 (pip), pandas 0.24.1 (pip) Logs*
> {code:java}
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    138.4 MiB    138.4 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.2 MiB     47.8 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     33.0 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.7 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.3 MiB    139.3 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.5 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.5 MiB    -47.7 MiB           1       del table
>     14    139.1 MiB    -32.4 MiB           1       del df
> Filename: C:/run_pyarrow_memoy_leak_sample.py
> Line #    Mem usage    Increment  Occurences   Line Contents
> ============================================================
>      9    139.1 MiB    139.1 MiB           1   @profile
>     10                                         def read_file(f):
>     11    186.8 MiB     47.7 MiB           1       table = pq.read_table(f)
>     12    219.2 MiB     32.4 MiB           1       df = table.to_pandas(strings_to_categorical=True)
>     13    171.8 MiB    -47.5 MiB           1       del table
>     14    139.3 MiB    -32.4 MiB           1       del df
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)