You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/01/11 21:51:00 UTC
[jira] [Updated] (ARROW-10417) [Python][C++] Possible Memory Leak
in RecordBatchStreamWriter and RecordBatchFileWriter
[ https://issues.apache.org/jira/browse/ARROW-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-10417:
--------------------------------
Attachment: arrow-10417-memtest-1.png
> [Python][C++] Possible Memory Leak in RecordBatchStreamWriter and RecordBatchFileWriter
> ---------------------------------------------------------------------------------------
>
> Key: ARROW-10417
> URL: https://issues.apache.org/jira/browse/ARROW-10417
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.15.1, 2.0.0
> Environment: This is the config for my worker node:
> resources:
> - cpus: 1
> - maxMemoryMb: 4096
> - reservedMemoryMb: 2048
> Reporter: Shouheng Yi
> Priority: Major
> Fix For: 2.0.1, 3.0.0
>
> Attachments: Screen Shot 2020-10-28 at 9.43.32 PM.png, Screen Shot 2020-10-28 at 9.43.40 PM.png, Screen Shot 2020-10-29 at 9.22.58 AM.png, arrow-10417-memtest-1.png
>
>
> There might be a memory leak in the {{RecordBatchStreamWriter}}. The memory resources were not released. It always hit the memory limit and started doing virtual memory swapping. See the picture below:
> !Screen Shot 2020-10-28 at 9.43.32 PM.png!
> This was the code:
> {code:python}
> import tempfile
> import os
> import sys
> import pyarrow as pa
> B = 1
> KB = 1024 * B
> MB = 1024 * KB
> schema = pa.schema(
> [
> pa.field("a_string", pa.string()),
> pa.field("an_int", pa.int32()),
> pa.field("a_float", pa.float32()),
> pa.field("a_list_of_floats", pa.list_(pa.float32())),
> ]
> )
> nrows_in_a_batch = 1000
> nbatches_in_a_table = 1000
> column_arrays = [
> ["string"] * nrows_in_a_batch,
> [123] * nrows_in_a_batch,
> [456.789] * nrows_in_a_batch,
> [range(1000)] * nrows_in_a_batch,
> ]
> def main(sys_args) -> None:
> batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema)
> table = pa.Table.from_batches([batch] * nbatches_in_a_table, schema=schema)
> with tempfile.TemporaryDirectory() as tmpdir:
> filename_template = "file-{n}.arror"
> i = 0
> while True:
> path = os.path.join(tmpdir, filename_template.format(n=i))
> i += 1
> with pa.OSFile(path, "w") as sink:
> with pa.RecordBatchStreamWriter(sink, schema) as writer:
> writer.write_table(table)
> print(f"pa.total_allocated_bytes(): {pa.total_allocated_bytes() / MB} mb")
> if __name__ == "__main__":
> main(sys.argv[1:])
> {code}
> Strangely enough, printing {{total_allocated_bytes}}, it seemed normal.
> {code:python}
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> {code}
> Am I using {{RecordBatchStreamWriter}} incorrectly? If not, how can I release the resources?
> [Updates 10/29/2020]
> I tested on {{pyarrow==2.0.0}}. I still see the same issue.
> !Screen Shot 2020-10-29 at 9.22.58 AM.png!
> I changed {{RecordBatchStreamWriter}} to {{RecordBatchFileWriter}} in my code, i.e.:
> {code:python}
> ...
> with pa.OSFile(path, "w") as sink:
> with pa.RecordBatchFileWriter(sink, schema) as writer:
> writer.write_table(table)
> print(f"pa.total_allocated_bytes(): {pa.total_allocated_bytes() / MB} mb")
> ...
> {code}
> I observed the same memory profile. I'm wondering if it is caused by [WriteRecordBatch |https://github.com/apache/arrow/blob/maint-0.15.x/cpp/src/arrow/ipc/writer.cc#L594] not being able to release memory.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)