You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Ziheng Wang (Jira)" <ji...@apache.org> on 2022/03/27 03:33:00 UTC

[jira] [Created] (ARROW-16037) Possible memory leak in compute.take

Ziheng Wang created ARROW-16037:
-----------------------------------

             Summary: Possible memory leak in compute.take
                 Key: ARROW-16037
                 URL: https://issues.apache.org/jira/browse/ARROW-16037
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 6.0.1
         Environment: Ubuntu
            Reporter: Ziheng Wang


If you run the following code, the memory usage of the process goes up to 1GB even though the pyarrow allocated bytes is always at ~80MB. The process memory comes down after a while to 800 MB, but is still way more than what is necessary.

'''

import pyarrow as pa
import numpy as np
import pandas as pd
import os, psutil
import pyarrow.compute as compute
import gc
my_table = pa.Table.from_pandas(pd.DataFrame(np.random.normal(size=(10000,1000))))

process = psutil.Process(os.getpid())
print("mem usage", process.memory_info().rss, pa.total_allocated_bytes())

for i in range(100):
    print("mem usage", process.memory_info().rss, pa.total_allocated_bytes())
    temp = compute.sort_indices(my_table['0'], sort_keys = [('0','ascending')])
    my_table = my_table.take(temp)
    gc.collect()

'''



--
This message was sent by Atlassian Jira
(v8.20.1#820001)