You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Yp Xie <pe...@gmail.com> on 2022/02/21 14:52:57 UTC

Re: [Python] weird memory usage issue when reading a parquet file.

Thanks Wes, Weston for your explanation.

just tried to wait 5s after the to_table action, and indeed the memory
usage reported by psutil decreased to a reasonable size.

thanks again.

- xyp

Weston Pace <we...@gmail.com> 于2022年1月3日周一 23:02写道:

> Wes' theory seems sound.  Perhaps the easiest way to test that theory
> would be to put a five second sleep after the to_table call and before you
> run show_mem.  In theory 1s is long enough but 5s is nice to remove any
> doubt.
>
> If there is a filter (that cannot be serviced by parquet row group
> statistics) there will be more total allocation.  This is because we first
> need to read in the full row group and then we need to filter it which is a
> copy operation to a (hopefully) smaller sized row group.
>
> The filtering should happen after the column pruning but if the filter is
> referencing any columns that are not included in the final result then we
> will need to load in those additional columns, use them for the filter, and
> then drop them.  This is another way you might end up with more total
> allocation if you use a filter.
>
> -Weston
>
> On Mon, Jan 3, 2022 at 3:10 AM Wes McKinney <we...@gmail.com> wrote:
>
>> By default we use jemalloc as our memory allocator which empirically has
>> been seen to yield better application performance. jemalloc does not
>> release memory to the operating system right away, this can be altered by
>> using a different default allocator (for example, the system allocator may
>> return memory to the OS right away):
>>
>>
>> https://arrow.apache.org/docs/cpp/memory.html#overriding-the-default-memory-pool
>>
>> I expect that the reason that psutil-reported allocated memory is higher
>> in the last case is because some temporary allocations made during the
>> filtering process are raising the "high water mark". I believe can see what
>> is reported as the peak memory allocation by looking at
>> pyarrow.default_memory_pool().max_memory()
>>
>> On Mon, Dec 20, 2021 at 5:10 AM Yp Xie <pe...@gmail.com> wrote:
>>
>>> Hi guys,
>>>
>>> I'm getting this weird memory usage info when I tried to start using
>>> pyarrow to read a parquet file.
>>>
>>> I wrote a simple script to show how much memory is consumed after each
>>> step.
>>> the result is illustrated in the table:
>>>
>>> row number pa.total_allocated_bytes memory usage by psutil
>>> without filters 5131100 177M 323M
>>> with field filter 57340 2041K 323M
>>> with column pruning 5131100 48M 154M
>>> with both field filter and column pruning 57340 567K 204M
>>>
>>> the weird part is: the total memory usage when I apply both field filter
>>> and column pruning is *larger* than only column pruning applied.
>>>
>>> I don't know how that happened, do you guys know the reason for this?
>>>
>>> thanks.
>>>
>>> env info:
>>>
>>> platform: Linux-5.4.0-91-generic-x86_64-with-glibc2.10
>>> distro info: ('Ubuntu', '20.04', 'focal')
>>> pyarrow: 6.0.1
>>>
>>>
>>> script code:
>>>
>>> import pyarrow as pa
>>> import psutil
>>> import os
>>> import pyarrow.dataset as ds
>>>
>>> pid = os.getpid()
>>>
>>> def show_mem(action: str) -> None:
>>>     mem = psutil.Process(pid).memory_info()[0] >> 20
>>>     print(f"******* memory usage after {action} **********")
>>>     print(f"*                   {mem}M                    *")
>>>     print(f"**********************************************")
>>>
>>> dataset = ds.dataset("tmp/uber.parquet", format="parquet")
>>> show_mem("read dataset")
>>> projection = {
>>>     "Dispatching_base_num": ds.field("Dispatching_base_num")
>>> }
>>> filter = ds.field("locationID") == 100
>>> table = dataset.to_table(
>>>     filter=filter,
>>>     columns=projection
>>>     )
>>> print(f"table row number: {table.num_rows}")
>>> print(f"total bytes: {pa.total_allocated_bytes() >> 10}K")
>>> show_mem("dataset.to_table")
>>>
>>