You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Norbert (Jira)" <ji...@apache.org> on 2022/10/25 13:53:00 UTC
[jira] [Created] (ARROW-18156) [Python/C++] High memory usage/potential leak when reading parquet using Dataset API

Norbert created ARROW-18156:
-------------------------------

             Summary: [Python/C++] High memory usage/potential leak when reading parquet using Dataset API
                 Key: ARROW-18156
                 URL: https://issues.apache.org/jira/browse/ARROW-18156
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet
    Affects Versions: 4.0.1
            Reporter: Norbert


Hi,

I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the following snippet:

 
{code:java}
import os
import pyarrow
import pyarrow.dataset as ds
from importlib_metadata import version
from psutil import Process
import pyarrow.parquet as pq

def format_bytes(num_bytes: int):
    return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB"
 
def main():
    print(version("pyarrow"))
    print(pyarrow.default_memory_pool().backend_name)
    process = Process(os.getpid())
    runs = 10
    print(f"Runs: {runs}")
    for i in range(runs):
        with log_memory_usage(f"Run {i} load parquet file"):
        dataset = ds.dataset("df.pq")
        table = dataset.to_table()
        df = table.to_pandas()
        print(f"After run {i}: RSS = {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes = {format_bytes(pyarrow.total_allocated_bytes())}")
{code}
 

 

On PyArrow v4.0.1 the output is as follows:
{code:java}
4.0.1
system
Runs: 10
After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB
After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB
After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB
After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB
After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB
After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB
After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB
After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB
After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB
After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB
After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB{code}

If I replace ds.dataset("df.pq").to_table() with pq.ParquetFile("df.pq").read(), the output is:
{code:java}
4.0.1
system
Runs: 10
After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB
After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB
After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB
After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB
After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB
After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB{code}

The usage profile of the older non-dataset API is much lower - it matches the size of the dataframe much closer. It also seems like in the former example, there is a memory leak? I thought that the increase in RSS was just due to PyArrow's usage of jemalloc, but I seem to be using the system allocator here.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)