You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/03/04 01:59:00 UTC

[jira] [Created] (ARROW-11855) [C++] [Python] Memory leak in to_pandas when converting chunked struct array

Weston Pace created ARROW-11855:
-----------------------------------

             Summary: [C++] [Python] Memory leak in to_pandas when converting chunked struct array
                 Key: ARROW-11855
                 URL: https://issues.apache.org/jira/browse/ARROW-11855
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
            Reporter: Weston Pace
            Assignee: Weston Pace


Reproduction from [~shadowdsp]
{code:java}
import io
import pandas as pd
import pyarrow as pa
pa.jemalloc_set_decay_ms(0)
import pyarrow.parquet as pq
from memory_profiler import profile

@profile
def read_file(f):
    table = pq.read_table(f)
    df = table.to_pandas(strings_to_categorical=True)
    del table
    del df

def main():
    rows = 2000000
    df = pd.DataFrame({
        "string": [{"test": [1, 2], "test1": [3, 4]}] * rows,
        "int": [5] * rows,
        "float": [2.0] * rows,
    })
    table = pa.Table.from_pandas(df, preserve_index=False)
    parquet_stream = io.BytesIO()
    pq.write_table(table, parquet_stream)
    for i in range(3):
        parquet_stream.seek(0)
        read_file(parquet_stream)

if __name__ == '__main__':
    main()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)