You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by "James Porritt (JIRA)" <ji...@apache.org> on 2017/05/12 17:13:04 UTC

[jira] [Created] (ARROW-1017) Python: Calling to_pandas on a Parquet file in HDFS leaks memory

James Porritt created ARROW-1017:
------------------------------------

             Summary: Python: Calling to_pandas on a Parquet file in HDFS leaks memory
                 Key: ARROW-1017
                 URL: https://issues.apache.org/jira/browse/ARROW-1017
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.3.0
            Reporter: James Porritt


Running the following code results in ever increasing memory usage, even though I would expect the dataframe to be garbage collected when it goes out of scope. For the size of my parquet file, I see the usage increasing about 3GB per loop:

{code}
from pyarrow import HdfsClient

def read_parquet_file(client, parquet_file):
    parquet = client.read_parquet(parquet_file)
    df = parquet.to_pandas()

client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
parquet_file = '/my/parquet/file
while True:
    read_parquet_file(client, parquet_file)
{code}

Is there a reference count issue similar to ARROW-362?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)