You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Norbert (Jira)" <ji...@apache.org> on 2022/10/25 18:11:00 UTC

[jira] [Comment Edited] (ARROW-18156) [Python/C++] High memory usage/potential leak when reading parquet using Dataset API

    [ https://issues.apache.org/jira/browse/ARROW-18156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623966#comment-17623966 ] 

Norbert edited comment on ARROW-18156 at 10/25/22 6:10 PM:
-----------------------------------------------------------

[~jorisvandenbossche] Thanks for the fast response, here is a snippet which generates a representative DataFrame of a similar size with the same dtypes:

 
{code:java}
    dates = pd.date_range("2020-10-01", "2022-09-30")
    identifiers = [i for i in range(5000)]
    times = []
    for hour in range(9, 17):
        for minute in range(6):
            times.append(f"{hour:02d}:{minute * 10:02d}:00")
    dates = pd.Series(dates, dtype="datetime64[ns]")
    identifiers = pd.Series(identifiers, dtype="int64")
    times = pd.Series(times, dtype="category")

    index = pd.MultiIndex.from_product(
        [dates, identifiers, times], names=["date", "identifier", "time"]
    )
    df = pd.DataFrame({"value": np.random.rand(len(index))}, index=index)
    print(df)
    print(df.dtypes)
    print(df.index.dtypes)
    print(format_bytes(df.memory_usage(deep=True).sum()))
    df.to_parquet("df.pq")
{code}
Output:

 
{code:java}
                                   value
date       identifier time
2020-10-01 0          09:00:00  0.697832
                      09:10:00  0.162727
                      09:20:00  0.902748
                      09:30:00  0.925639
                      09:10:00  0.162727
                      09:20:00  0.902748
                      09:30:00  0.925639
                      09:40:00  0.852034
...                                  ...
2022-09-30 4999       16:10:00  0.420191
                      16:20:00  0.223169
                      16:30:00  0.838116
                      16:40:00  0.795141
                      16:50:00  0.882058
[175200000 rows x 1 columns]
value    float64
dtype: object
date          datetime64[ns]
identifier             int64
time                category
dtype: object
2.12 GB
{code}
I can't test on 9.0.0 at the moment - would you be able to run it for me?

 


was (Author: JIRAUSER297458):
Thanks for the fast response, here is a snippet which generates a representative DataFrame of a similar size with the same dtypes:

 
{code:java}
    dates = pd.date_range("2020-10-01", "2022-09-30")
    identifiers = [i for i in range(5000)]
    times = []
    for hour in range(9, 17):
        for minute in range(6):
            times.append(f"{hour:02d}:{minute * 10:02d}:00")
    dates = pd.Series(dates, dtype="datetime64[ns]")
    identifiers = pd.Series(identifiers, dtype="int64")
    times = pd.Series(times, dtype="category")

    index = pd.MultiIndex.from_product(
        [dates, identifiers, times], names=["date", "identifier", "time"]
    )
    df = pd.DataFrame({"value": np.random.rand(len(index))}, index=index)
    print(df)
    print(df.dtypes)
    print(df.index.dtypes)
    print(format_bytes(df.memory_usage(deep=True).sum()))
    df.to_parquet("df.pq")
{code}
Output:

 
{code:java}
                                   value
date       identifier time
2020-10-01 0          09:00:00  0.697832
                      09:10:00  0.162727
                      09:20:00  0.902748
                      09:30:00  0.925639
                      09:10:00  0.162727
                      09:20:00  0.902748
                      09:30:00  0.925639
                      09:40:00  0.852034
...                                  ...
2022-09-30 4999       16:10:00  0.420191
                      16:20:00  0.223169
                      16:30:00  0.838116
                      16:40:00  0.795141
                      16:50:00  0.882058
[175200000 rows x 1 columns]
value    float64
dtype: object
date          datetime64[ns]
identifier             int64
time                category
dtype: object
2.12 GB
{code}
I can't test on 9.0.0 at the moment - would you be able to run it for me?

 

> [Python/C++] High memory usage/potential leak when reading parquet using Dataset API
> ------------------------------------------------------------------------------------
>
>                 Key: ARROW-18156
>                 URL: https://issues.apache.org/jira/browse/ARROW-18156
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet
>    Affects Versions: 4.0.1
>            Reporter: Norbert
>            Priority: Major
>
> Hi,
> I have a 2.35 GB DataFrame (1.17 GB on-disk size) which I'm loading using the following snippet:
>  
> {code:java}
> import os
> import pyarrow
> import pyarrow.dataset as ds
> from importlib_metadata import version
> from psutil import Process
> import pyarrow.parquet as pq
> def format_bytes(num_bytes: int):
>     return f"{num_bytes / 1024 / 1024 / 1024:.2f} GB"
>  
> def main():
>     print(version("pyarrow"))
>     print(pyarrow.default_memory_pool().backend_name)
>     process = Process(os.getpid())
>     runs = 10
>     print(f"Runs: {runs}")
>     for i in range(runs):
>         dataset = ds.dataset("df.pq")
>         table = dataset.to_table()
>         df = table.to_pandas()
>         print(f"After run {i}: RSS = {format_bytes(process.memory_info().rss)}, PyArrow Allocated Bytes = {format_bytes(pyarrow.total_allocated_bytes())}")
> {code}
>  
>  
> On PyArrow v4.0.1 the output is as follows:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 7.59 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 1: RSS = 13.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 2: RSS = 14.74 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 3: RSS = 15.78 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 4: RSS = 18.36 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 5: RSS = 19.69 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 6: RSS = 21.21 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 7: RSS = 21.52 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 8: RSS = 21.49 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 9: RSS = 21.72 GB, PyArrow Allocated Bytes = 6.09 GB
> After run 10: RSS = 20.95 GB, PyArrow Allocated Bytes = 6.09 GB{code}
> If I replace ds.dataset("df.pq").to_table() with pq.ParquetFile("df.pq").read(), the output is:
> {code:java}
> 4.0.1
> system
> Runs: 10
> After run 0: RSS = 2.38 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 1: RSS = 2.49 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 2: RSS = 2.50 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 3: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 4: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 5: RSS = 2.56 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 6: RSS = 2.53 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 7: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 8: RSS = 2.48 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 9: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB
> After run 10: RSS = 2.51 GB, PyArrow Allocated Bytes = 1.34 GB{code}
> The usage profile of the older non-dataset API is much lower - it matches the size of the dataframe much closer. It also seems like in the former example, there is a memory leak? I thought that the increase in RSS was just due to PyArrow's usage of jemalloc, but I seem to be using the system allocator here.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)