You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Weichen Xu (Jira)" <ji...@apache.org> on 2020/08/28 03:42:00 UTC

[jira] [Created] (ARROW-9878) arrow table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory.

Weichen Xu created ARROW-9878:
---------------------------------

             Summary: arrow table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory.
                 Key: ARROW-9878
                 URL: https://issues.apache.org/jira/browse/ARROW-9878
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Weichen Xu


Reproduce code:

Generate about 800MB data first.
{code: python}

import pyarrow as pa

# generate about 800MB data
data = [pa.array([10]* 1000)]
batch = pa.record_batch(data, names=['f0'])
with open('/tmp/t1.pa', 'wb') as f1:
	writer = pa.ipc.new_stream(f1, batch.schema)
	for i in range(100000):
		writer.write_batch(batch)
	writer.close()
{code}

Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False

{code:python}
import pyarrow as pa
import time
import sys

import os
pid = os.getpid()
print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
sys.stdin.readline()

with open('/tmp/t1.pa', 'rb') as f1:
	reader = pa.ipc.open_stream(f1)
	batches = [b for b in reader]

pa_table = pa.Table.from_batches(batches)
del batches
time.sleep(3)
pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, use_threads=False)
del pa_table
time.sleep(3)
{code}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)