You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/11/22 12:25:00 UTC

[jira] [Comment Edited] (ARROW-14790) Memory leak when reading CSV files

    [ https://issues.apache.org/jira/browse/ARROW-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447365#comment-17447365 ] 

Antoine Pitrou edited comment on ARROW-14790 at 11/22/21, 12:24 PM:
--------------------------------------------------------------------

Two things:

1) {{sys.getrefcount(object)}} in Python will not tell you anything actually useful

2) this does not look like a memory leak since memory consumption seems to reach a fixed point. Modern memory allocators are complex and they don't necessarily return memory _to the system_ when the memory is freed, because it can be costly. Instead, they use heuristics and keep some freed memory as a cache for future allocations.

PyArrow exposes an API to try and return released memory to the system. It is best-effort since it relies on how the underlying allocator (e.g. jemalloc) works:
{code:python}
>>> pool = pa.default_memory_pool()
>>> pool.release_unused()
{code}

You may try that in your script. I don't know if we expose the same API in Ruby. [~kou]


was (Author: pitrou):
Two things:

1) {{sys.getrefcount(object)}} in Python will not tell you anything actually useful

2) this does not look like a memory leak since memory consumption seems to reach a fixed point. Modern memory allocators are complex and they don't necessarily return memory {{to the system}} when the memory is freed, because it can be costly. Instead, they use heuristics and keep some freed memory as a cache for future allocations.

PyArrow exposes an API to try and return released memory to the system. It is best-effort since it relies on how the underlying allocator (e.g. jemalloc) works:
{code:python}
>>> pool = pa.default_memory_pool()
>>> pool.release_unused()
{code}

You may try that in your script. I don't know if we expose the same API in Ruby. [~kou]

> Memory leak when reading CSV files
> ----------------------------------
>
>                 Key: ARROW-14790
>                 URL: https://issues.apache.org/jira/browse/ARROW-14790
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Sten Larsson
>            Priority: Major
>
> We're having problem with a memory leak in a Ruby script that processes many CSV files. I have written some short scripts do demonstrate the problem: [https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214]
> The first script, [arrow_test_csv.rb|https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_test_csv-rb], creates a 184 MB CSV file for testing.
> The second script, [arrow_memory_leak.rb|https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_memory_leak-rb], then loads the CSV file 10 times using Arrow. It uses the [get_process_mem|https://rubygems.org/gems/get_process_mem] gem to print the memory usage both before and after each iteration. It also invokes the garbage collector on each iteration to ensure the problem is not that Ruby holds on to any objects. This is what it prints on my MacBook Pro using Arrow 6.0.0:
> {noformat}
> 127577 objects, 34.234375 MB
> 127577 objects, 347.625 MB
> 127577 objects, 438.7890625 MB
> 127577 objects, 457.6953125 MB
> 127577 objects, 469.8046875 MB
> 127577 objects, 480.88671875 MB
> 127577 objects, 487.96484375 MB
> 127577 objects, 493.8359375 MB
> 127577 objects, 497.671875 MB
> 127577 objects, 498.55859375 MB
> 127577 objects, 501.42578125 MB
> {noformat}
> The third script, [arrow_memory_leak.py |https://gist.github.com/stenlarsson/60b1e4e99416738b41ee30e7ba294214#file-arrow_memory_leak-py] is a Python implementation of the same script. This shows that the problem is not in the Ruby bindings:
> {noformat}
> 2106 objects, 31.75390625 MB
> 2106 objects, 382.28515625 MB
> 2106 objects, 549.41796875 MB
> 2106 objects, 656.78125 MB
> 2106 objects, 679.6875 MB
> 2106 objects, 691.9921875 MB
> 2106 objects, 708.73828125 MB
> 2106 objects, 717.296875 MB
> 2106 objects, 724.390625 MB
> 2106 objects, 729.19921875 MB
> 2106 objects, 734.47265625 MB
> {noformat}
> I have also tested Arrow 5.0.0 and it has the same problem.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)