You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2022/08/16 20:49:00 UTC
[jira] [Created] (ARROW-17441) [Python] Memory kept after del and pool.released_unused()
Will Jones created ARROW-17441:
----------------------------------
Summary: [Python] Memory kept after del and pool.released_unused()
Key: ARROW-17441
URL: https://issues.apache.org/jira/browse/ARROW-17441
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 9.0.0
Reporter: Will Jones
I was trying reproduce another issue involving memory pools not releasing memory, but encountered this confusing behavior: if I create a table, then call {{{}del table{}}}, and then {{{}pool.release_unused(){}}}, I still see significant memory usage. On mimalloc in particular, I see no meaningful drop in memory usage on either call.
Am I missing something?
{code:python}
import os
import psutil
import time
import gc
process = psutil.Process(os.getpid())
import numpy as np
from uuid import uuid4
import pyarrow as pa
def gen_batches(n_groups=200, rows_per_group=200_000):
for _ in range(n_groups):
id_val = uuid4().bytes
yield pa.table({
"x": np.random.random(rows_per_group), # This will compress poorly
"y": np.random.random(rows_per_group),
"a": pa.array(list(range(rows_per_group)), type=pa.int32()), # This compresses with delta encoding
"id": pa.array([id_val] * rows_per_group), # This compresses with RLE
})
def print_rss():
print(f"RSS: {process.memory_info().rss:,} bytes")
print(f"memory_pool={pa.default_memory_pool().backend_name}")
print_rss()
print("reading table")
tab = pa.concat_tables(list(gen_batches()))
print_rss()
print("deleting table")
del tab
gc.collect()
print_rss()
print("releasing unused memory")
pa.default_memory_pool().release_unused()
print_rss()
print("waiting 10 seconds")
time.sleep(10)
print_rss()
{code}
{code:none}
> ARROW_DEFAULT_MEMORY_POOL=mimalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=jemalloc python test_pool.py && \
ARROW_DEFAULT_MEMORY_POOL=system python test_pool.py
memory_pool=mimalloc
RSS: 44,449,792 bytes
reading table
RSS: 1,819,557,888 bytes
deleting table
RSS: 1,819,590,656 bytes
releasing unused memory
RSS: 1,819,852,800 bytes
waiting 10 seconds
RSS: 1,819,852,800 bytes
memory_pool=jemalloc
RSS: 45,629,440 bytes
reading table
RSS: 1,668,677,632 bytes
deleting table
RSS: 698,400,768 bytes
releasing unused memory
RSS: 699,023,360 bytes
waiting 10 seconds
RSS: 699,023,360 bytes
memory_pool=system
RSS: 44,875,776 bytes
reading table
RSS: 1,713,569,792 bytes
deleting table
RSS: 540,311,552 bytes
releasing unused memory
RSS: 540,311,552 bytes
waiting 10 seconds
RSS: 540,311,552 bytes
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)