You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/05/18 08:00:15 UTC

[jira] [Commented] (ARROW-8801) [Python] Memory leak on read from parquet file with UTC timestamps using pandas

    [ https://issues.apache.org/jira/browse/ARROW-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110005#comment-17110005 ] 

Joris Van den Bossche commented on ARROW-8801:
----------------------------------------------

From a quick test, this is not related to parquet (reading with pyarrow instead of with pandas doesn't show the issue), but rather due to the pyarrow-to-pandas conversion (doing a loop with just {{table.to_pandas()}} does show a memory increase)

{code}
x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', utc=True)     
table = pa.table(pd.DataFrame({'x': x}))                                                                                                                                                                   

for _ in range(2**8): 
    table.to_pandas() 
{code}

Did you get the same issue on pyarrow 0.16 (since you mentioned you tested that as well), or is it only present since 0.17?

> [Python] Memory leak on read from parquet file with UTC timestamps using pandas
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-8801
>                 URL: https://issues.apache.org/jira/browse/ARROW-8801
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.17.0
>         Environment: Tested using pyarrow 0.17.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux).
>            Reporter: Rauli Ruohonen
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Given dump.py script 
>  
> {code:java}
> import pandas as pd
> import numpy as np
> x = pd.to_datetime(np.random.randint(0, 2**32, size=2**20), unit='ms', utc=True)
> pd.DataFrame({'x': x}).to_parquet('data.parquet', engine='pyarrow', compression=None)
> {code}
> and load.py script
>  
> {code:java}
> import sys
> import pandas as pd
> def foo(engine):
>     for _ in range(2**9):
>         pd.read_parquet('data.parquet', engine=engine)
>     print('Done')
>     input()
> foo(sys.argv[1])
> {code}
> running first "python dump.py" and then "python load.py pyarrow", on my machine python memory usage stays at 4+ GB while it waits for input. If using "python load.py fastparquet" instead, it is about 100 MB, so it should be a pyarrow issue instead of a pandas issue. The leak disappears if "utc=True" is removed from dump.py, in which case the timestamp is timezone-unaware.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)