You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2017/05/18 21:46:04 UTC
[jira] [Created] (ARROW-1053) [Python] Memory leak with
RecordBatchFileReader
Bryan Cutler created ARROW-1053:
-----------------------------------
Summary: [Python] Memory leak with RecordBatchFileReader
Key: ARROW-1053
URL: https://issues.apache.org/jira/browse/ARROW-1053
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Bryan Cutler
While working on SPARK-13534 and running repeated calls to {{toPandas}}, memory usage continues to climb and I isolated to the Python side. The following code reproduces the issue, which looks like a memory leak. Commenting out the block with the {{RecordBatchFileReader}} while leaving the writer, memory usage is stable, so I believe the issue is with the reader.
{noformat}
import pyarrow as pa
import numpy as np
import memory_profiler
import gc
import io
def leak():
data = [pa.array(np.concatenate([np.random.randn(100000)] * 10))]
table = pa.Table.from_arrays(data, ['foo'])
while True:
print('calling to_pandas')
print('memory_usage: {0}'.format(memory_profiler.memory_usage()))
df = table.to_pandas()
batch = pa.RecordBatch.from_pandas(df)
sink = io.BytesIO()
writer = pa.RecordBatchFileWriter(sink, batch.schema)
writer.write_batch(batch)
writer.close()
reader = pa.open_file(pa.BufferReader(sink.getvalue()))
reader.read_all()
gc.collect()
leak()
{noformat}
Some of the output from the code above:
{noformat}
calling to_pandas
memory_usage: [67.0546875]
calling to_pandas
memory_usage: [143.95703125]
calling to_pandas
memory_usage: [151.58984375]
calling to_pandas
memory_usage: [174.453125]
calling to_pandas
memory_usage: [189.84765625]
calling to_pandas
memory_usage: [212.7109375]
calling to_pandas
memory_usage: [228.046875]
calling to_pandas
memory_usage: [243.109375]
calling to_pandas
memory_usage: [258.4375]
calling to_pandas
memory_usage: [273.83203125]
calling to_pandas
memory_usage: [288.90234375]
calling to_pandas
memory_usage: [304.23046875]
calling to_pandas
memory_usage: [319.625]
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)