You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2017/05/18 21:58:05 UTC
[jira] [Assigned] (ARROW-1053) [Python] Memory leak with
RecordBatchFileReader
[ https://issues.apache.org/jira/browse/ARROW-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney reassigned ARROW-1053:
-----------------------------------
Assignee: Wes McKinney
> [Python] Memory leak with RecordBatchFileReader
> -----------------------------------------------
>
> Key: ARROW-1053
> URL: https://issues.apache.org/jira/browse/ARROW-1053
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Bryan Cutler
> Assignee: Wes McKinney
>
> While working on SPARK-13534 and running repeated calls to {{toPandas}}, memory usage continues to climb and I isolated to the Python side. The following code reproduces the issue, which looks like a memory leak. Commenting out the block with the {{RecordBatchFileReader}} while leaving the writer, memory usage is stable, so I believe the issue is with the reader.
> {noformat}
> import pyarrow as pa
> import numpy as np
> import memory_profiler
> import gc
> import io
> def leak():
> data = [pa.array(np.concatenate([np.random.randn(100000)] * 10))]
> table = pa.Table.from_arrays(data, ['foo'])
> while True:
> print('calling to_pandas')
> print('memory_usage: {0}'.format(memory_profiler.memory_usage()))
> df = table.to_pandas()
> batch = pa.RecordBatch.from_pandas(df)
> sink = io.BytesIO()
> writer = pa.RecordBatchFileWriter(sink, batch.schema)
> writer.write_batch(batch)
> writer.close()
> reader = pa.open_file(pa.BufferReader(sink.getvalue()))
> reader.read_all()
> gc.collect()
> leak()
> {noformat}
> Some of the output from the code above:
> {noformat}
> calling to_pandas
> memory_usage: [67.0546875]
> calling to_pandas
> memory_usage: [143.95703125]
> calling to_pandas
> memory_usage: [151.58984375]
> calling to_pandas
> memory_usage: [174.453125]
> calling to_pandas
> memory_usage: [189.84765625]
> calling to_pandas
> memory_usage: [212.7109375]
> calling to_pandas
> memory_usage: [228.046875]
> calling to_pandas
> memory_usage: [243.109375]
> calling to_pandas
> memory_usage: [258.4375]
> calling to_pandas
> memory_usage: [273.83203125]
> calling to_pandas
> memory_usage: [288.90234375]
> calling to_pandas
> memory_usage: [304.23046875]
> calling to_pandas
> memory_usage: [319.625]
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)