You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/02/27 13:07:00 UTC
[jira] [Commented] (ARROW-9878) [Python] table to_pandas
self_destruct=True + split_blocks=True cannot prevent doubling memory
[ https://issues.apache.org/jira/browse/ARROW-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17292130#comment-17292130 ]
David Li commented on ARROW-9878:
---------------------------------
[~weichenxu123] sorry I missed this. The type does not matter; it is about the physical layout of data in memory. My tests were always done with arrays of doubles.
> [Python] table to_pandas self_destruct=True + split_blocks=True cannot prevent doubling memory
> ----------------------------------------------------------------------------------------------
>
> Key: ARROW-9878
> URL: https://issues.apache.org/jira/browse/ARROW-9878
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.17.1, 1.0.0
> Reporter: Weichen Xu
> Assignee: David Li
> Priority: Major
> Attachments: t001.png
>
>
> Test on: pyarrow 1.0.1, system: Ubuntu 16.04, python3.7
>
> Reproduce code:
> Generate about 800MB data first.
> {code:java}
> import pyarrow as pa
> # generate about 800MB data
> data = [pa.array([10]* 1000)]
> batch = pa.record_batch(data, names=['f0'])
> with open('/tmp/t1.pa', 'wb') as f1:
> writer = pa.ipc.new_stream(f1, batch.schema)
> for i in range(100000):
> writer.write_batch(batch)
> writer.close()
> {code}
> Test to_pandas with self_destruct=True, split_blocks=True, use_threads=False
> {code:python}
> import pyarrow as pa
> import time
> import sys
> import os
> pid = os.getpid()
> print(f'run `psrecord {pid} --plot /tmp/t001.png` and then press ENTER.')
> sys.stdin.readline()
> with open('/tmp/t1.pa', 'rb') as f1:
> reader = pa.ipc.open_stream(f1)
> batches = [b for b in reader]
> pa_table = pa.Table.from_batches(batches)
> del batches
> time.sleep(3)
> pdf = pa_table.to_pandas(self_destruct=True, split_blocks=True, use_threads=False)
> del pa_table
> time.sleep(3)
> {code}
> The attached file is psrecord profiling result.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)