You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "James Porritt (JIRA)" <ji...@apache.org> on 2017/05/12 17:13:04 UTC
[jira] [Created] (ARROW-1017) Python: Calling to_pandas on a
Parquet file in HDFS leaks memory
James Porritt created ARROW-1017:
------------------------------------
Summary: Python: Calling to_pandas on a Parquet file in HDFS leaks memory
Key: ARROW-1017
URL: https://issues.apache.org/jira/browse/ARROW-1017
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.3.0
Reporter: James Porritt
Running the following code results in ever increasing memory usage, even though I would expect the dataframe to be garbage collected when it goes out of scope. For the size of my parquet file, I see the usage increasing about 3GB per loop:
{code}
from pyarrow import HdfsClient
def read_parquet_file(client, parquet_file):
parquet = client.read_parquet(parquet_file)
df = parquet.to_pandas()
client = HdfsClient("hdfshost", 8020, "myuser", driver='libhdfs3')
parquet_file = '/my/parquet/file
while True:
read_parquet_file(client, parquet_file)
{code}
Is there a reference count issue similar to ARROW-362?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)