You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Bryan Cutler (JIRA)" <ji...@apache.org> on 2017/05/18 21:46:04 UTC

[jira] [Created] (ARROW-1053) [Python] Memory leak with RecordBatchFileReader

Bryan Cutler created ARROW-1053:
-----------------------------------

             Summary: [Python] Memory leak with RecordBatchFileReader
                 Key: ARROW-1053
                 URL: https://issues.apache.org/jira/browse/ARROW-1053
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
            Reporter: Bryan Cutler


While working on SPARK-13534 and running repeated calls to {{toPandas}}, memory usage continues to climb and I isolated to the Python side.  The following code reproduces the issue, which looks like a memory leak.  Commenting out the block with the {{RecordBatchFileReader}} while leaving the writer, memory usage is stable, so I believe the issue is with the reader.

{noformat}
import pyarrow as pa
import numpy as np
import memory_profiler
import gc
import io


def leak():
    data = [pa.array(np.concatenate([np.random.randn(100000)] * 10))]
    table = pa.Table.from_arrays(data, ['foo'])
    while True:
        print('calling to_pandas')
        print('memory_usage: {0}'.format(memory_profiler.memory_usage()))
        df = table.to_pandas()

        batch = pa.RecordBatch.from_pandas(df)

        sink = io.BytesIO()
        writer = pa.RecordBatchFileWriter(sink, batch.schema)
        writer.write_batch(batch)
        writer.close()

        reader = pa.open_file(pa.BufferReader(sink.getvalue()))
        reader.read_all()

        gc.collect()

leak()
{noformat}

Some of the output from the code above:
{noformat}
calling to_pandas
memory_usage: [67.0546875]
calling to_pandas
memory_usage: [143.95703125]
calling to_pandas
memory_usage: [151.58984375]
calling to_pandas
memory_usage: [174.453125]
calling to_pandas
memory_usage: [189.84765625]
calling to_pandas
memory_usage: [212.7109375]
calling to_pandas
memory_usage: [228.046875]
calling to_pandas
memory_usage: [243.109375]
calling to_pandas
memory_usage: [258.4375]
calling to_pandas
memory_usage: [273.83203125]
calling to_pandas
memory_usage: [288.90234375]
calling to_pandas
memory_usage: [304.23046875]
calling to_pandas
memory_usage: [319.625]
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)