You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2017/05/18 22:25:04 UTC
[jira] [Updated] (ARROW-1053) [Python] Memory leak with RecordBatchFileReader

     [ https://issues.apache.org/jira/browse/ARROW-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-1053:
--------------------------------
    Fix Version/s: 0.4.0

> [Python] Memory leak with RecordBatchFileReader
> -----------------------------------------------
>
>                 Key: ARROW-1053
>                 URL: https://issues.apache.org/jira/browse/ARROW-1053
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Bryan Cutler
>            Assignee: Wes McKinney
>             Fix For: 0.4.0
>
>
> While working on SPARK-13534 and running repeated calls to {{toPandas}}, memory usage continues to climb and I isolated to the Python side.  The following code reproduces the issue, which looks like a memory leak.  Commenting out the block with the {{RecordBatchFileReader}} while leaving the writer, memory usage is stable, so I believe the issue is with the reader.
> {noformat}
> import pyarrow as pa
> import numpy as np
> import memory_profiler
> import gc
> import io
> def leak():
>     data = [pa.array(np.concatenate([np.random.randn(100000)] * 10))]
>     table = pa.Table.from_arrays(data, ['foo'])
>     while True:
>         print('calling to_pandas')
>         print('memory_usage: {0}'.format(memory_profiler.memory_usage()))
>         df = table.to_pandas()
>         batch = pa.RecordBatch.from_pandas(df)
>         sink = io.BytesIO()
>         writer = pa.RecordBatchFileWriter(sink, batch.schema)
>         writer.write_batch(batch)
>         writer.close()
>         reader = pa.open_file(pa.BufferReader(sink.getvalue()))
>         reader.read_all()
>         gc.collect()
> leak()
> {noformat}
> Some of the output from the code above:
> {noformat}
> calling to_pandas
> memory_usage: [67.0546875]
> calling to_pandas
> memory_usage: [143.95703125]
> calling to_pandas
> memory_usage: [151.58984375]
> calling to_pandas
> memory_usage: [174.453125]
> calling to_pandas
> memory_usage: [189.84765625]
> calling to_pandas
> memory_usage: [212.7109375]
> calling to_pandas
> memory_usage: [228.046875]
> calling to_pandas
> memory_usage: [243.109375]
> calling to_pandas
> memory_usage: [258.4375]
> calling to_pandas
> memory_usage: [273.83203125]
> calling to_pandas
> memory_usage: [288.90234375]
> calling to_pandas
> memory_usage: [304.23046875]
> calling to_pandas
> memory_usage: [319.625]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)