You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/04/23 13:34:00 UTC

[jira] [Commented] (ARROW-12519) [C++] Create/document better characterization of jemalloc/mimalloc

    [ https://issues.apache.org/jira/browse/ARROW-12519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330466#comment-17330466 ] 

Jonathan Keane commented on ARROW-12519:
----------------------------------------

Interesting! I've definitely noticed this with the fanniemae csv but never uncovered what was going on with it.

If you look in the fanniemae columns of plots and especially the arrow_table rows of plots (where we only read into arrow and not then into a dataframe) you can see that the durations on those increase over time (which IME in situations like these correlate with memory (over)allocation / leakiness / what you describe). 

https://ursalabs.org/blog/2021-r-benchmarks-part-1/memory-allocators-full.png

Interestingly, this seems to happen with all datasets with jemalloc (there's some exploration of this though no real conclusion in ARROW-11433).

> [C++] Create/document better characterization of jemalloc/mimalloc
> ------------------------------------------------------------------
>
>                 Key: ARROW-12519
>                 URL: https://issues.apache.org/jira/browse/ARROW-12519
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> The following script reads in a large dataset 10 times in a loop.  The dataset being referred to is from Ursa benchmarks here ([https://github.com/ursacomputing/benchmarks).]  However, any sufficiently large db should be sufficient.  The dataset is ~5-6 GB when deserialized into an Arrow table.  The conversion to a dataframe is not zero-copy and so the loop requires about 8.6GB.
> Running this code 10 times with mimalloc consumes 27GB of RAM.  It is pretty deterministic.  Even putting a 1 second sleep in between each run yields the same result.  On the other hand if I put the read into its own method (second version below) then it uses only 14 GB.
> Our current rule of thumb seems to be "as long as the allocators stabilize to some number at some point then it is not a bug" so technically both 27GB and 14GB are valid.
> If we can't put any kind of bound whatsoever on the RAM that Arrow needs then it will eventually become a problem for adoption.  I think we need to develop some sort of characterization around how much mimalloc/jemalloc should be allowed to over-allocate before we consider it a bug and require changing the code to avoid the situation (or documenting that certain operations are not valid).
>  
> ----CODE----
>  
> // First version (uses ~27GB)
> {code:java}
> import time
> import pyarrow as pa
> import pyarrow.parquet as pq
> import psutil
> import os
> pa.set_memory_pool(pa.mimalloc_memory_pool())
> print(pa.default_memory_pool().backend_name)
> for _ in range(10):
>     table = pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
>     df = table.to_pandas()
>     print(pa.total_allocated_bytes())
>     proc = psutil.Process(os.getpid())
>     print(proc.memory_info())
> {code}
> // Second version (uses ~14GB)
> {code:java}
> import time
> import pyarrow as pa
> import pyarrow.parquet as pq
> import psutil
> import os
> pa.set_memory_pool(pa.mimalloc_memory_pool())
> print(pa.default_memory_pool().backend_name)
> def bm():
>     table = pq.read_table('/home/pace/dev/benchmarks/benchmarks/data/temp/fanniemae_2016Q4.uncompressed.parquet')
>     df = table.to_pandas()
>     print(pa.total_allocated_bytes())
>     proc = psutil.Process(os.getpid())
>     print(proc.memory_info())
> for _ in range(10):
>     bm()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)