You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/01/11 21:53:00 UTC

[jira] [Comment Edited] (ARROW-10417) [Python][C++] Possible Memory Leak in RecordBatchStreamWriter and RecordBatchFileWriter

    [ https://issues.apache.org/jira/browse/ARROW-10417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262950#comment-17262950 ] 

Weston Pace edited comment on ARROW-10417 at 1/11/21, 9:52 PM:
---------------------------------------------------------------

[~shouheng] I'm having trouble reproducing this with the file you've provided.  I am running on an Ubuntu 20.04.1 desktop and RAM usage stays constant.  Can you provide me some more details around the environment you are running in?  From the charts it appears you might be submitting some kind of batch job?  Do you know any details about what kind of OS is running or how it is configured?

If you put...
{code:java}
pa.jemalloc_set_decay_ms(10000)
{code}
..in your script (immediately after import pyarrow) does it change the behavior?

Also, do you see this kind of memory growth with other calls or is it specifically the record batch writer calls?  For example, can you try the following similar program and see if you see the same kind of growth?
{code:java}
import tempfile
import os
import sysimport pyarrow as paB = 1
KB = 1024 * B
MB = 1024 * KBschema = pa.schema(
    [
        pa.field("a_string", pa.string()),
        pa.field("an_int", pa.int32()),
        pa.field("a_float", pa.float32()),
        pa.field("a_list_of_floats", pa.list_(pa.float32())),
    ]
)nrows_in_a_batch = 1000
nbatches_in_a_table = 1000column_arrays = [
    ["string"] * nrows_in_a_batch,
    [123] * nrows_in_a_batch,
    [456.789] * nrows_in_a_batch,
    [range(1000)] * nrows_in_a_batch,
]def main(sys_args) -> None:    for iteration in range(1000):
        if iteration % 100 == 0:
            print(f'Percent complete: {100*iteration/1000.0}')
        batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema)
        table = pa.Table.from_batches([batch] * nbatches_in_a_table, schema=schema)if __name__ == "__main__":
    main(sys.argv[1:])

{code}
!arrow-10417-memtest-1.png!


was (Author: westonpace):
[~shouheng] I'm having trouble reproducing this with the file you've provided.  I am running on an Ubuntu 20.04.1 desktop and RAM usage stays constant.  Can you provide me some more details around the environment you are running in?  From the charts it appears you might be submitting some kind of batch job?  Do you know any details about what kind of OS is running or how it is configured?

If you put...
{code:java}
pa.jemalloc_set_decay_ms(10000)
{code}
..in your script (immediately after import pyarrow) does it change the behavior?

Also, do you see this kind of memory growth with other calls or is it specifically the record batch writer calls?  For example, can you try the following similar program and see if you see the same kind of growth?
{code:java}
import tempfile
import os
import sysimport pyarrow as paB = 1
KB = 1024 * B
MB = 1024 * KBschema = pa.schema(
    [
        pa.field("a_string", pa.string()),
        pa.field("an_int", pa.int32()),
        pa.field("a_float", pa.float32()),
        pa.field("a_list_of_floats", pa.list_(pa.float32())),
    ]
)nrows_in_a_batch = 1000
nbatches_in_a_table = 1000column_arrays = [
    ["string"] * nrows_in_a_batch,
    [123] * nrows_in_a_batch,
    [456.789] * nrows_in_a_batch,
    [range(1000)] * nrows_in_a_batch,
]def main(sys_args) -> None:    for iteration in range(1000):
        if iteration % 100 == 0:
            print(f'Percent complete: {100*iteration/1000.0}')
        batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema)
        table = pa.Table.from_batches([batch] * nbatches_in_a_table, schema=schema)if __name__ == "__main__":
    main(sys.argv[1:])
{code}
!arrow-10417-memtest-1.png!

> [Python][C++] Possible Memory Leak in RecordBatchStreamWriter and RecordBatchFileWriter
> ---------------------------------------------------------------------------------------
>
>                 Key: ARROW-10417
>                 URL: https://issues.apache.org/jira/browse/ARROW-10417
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.15.1, 2.0.0
>         Environment: This is the config for my worker node:
> resources:
>     - cpus: 1
>     - maxMemoryMb: 4096
>     - reservedMemoryMb: 2048
>            Reporter: Shouheng Yi
>            Priority: Major
>             Fix For: 2.0.1, 3.0.0
>
>         Attachments: Screen Shot 2020-10-28 at 9.43.32 PM.png, Screen Shot 2020-10-28 at 9.43.40 PM.png, Screen Shot 2020-10-29 at 9.22.58 AM.png, arrow-10417-memtest-1.png
>
>
> There might be a memory leak in the {{RecordBatchStreamWriter}}. The memory resources were not released. It always hit the memory limit and started doing virtual memory swapping. See the picture below:
> !Screen Shot 2020-10-28 at 9.43.32 PM.png!
> This was the code:
> {code:python}
> import tempfile
> import os
> import sys
> import pyarrow as pa
> B = 1
> KB = 1024 * B
> MB = 1024 * KB
> schema = pa.schema(
>     [
>         pa.field("a_string", pa.string()),
>         pa.field("an_int", pa.int32()),
>         pa.field("a_float", pa.float32()),
>         pa.field("a_list_of_floats", pa.list_(pa.float32())),
>     ]
> )
> nrows_in_a_batch = 1000
> nbatches_in_a_table = 1000
> column_arrays = [
>     ["string"] * nrows_in_a_batch,
>     [123] * nrows_in_a_batch,
>     [456.789] * nrows_in_a_batch,
>     [range(1000)] * nrows_in_a_batch,
> ]
> def main(sys_args) -> None:
>     batch = pa.RecordBatch.from_arrays(column_arrays, schema=schema)
>     table = pa.Table.from_batches([batch] * nbatches_in_a_table, schema=schema)
>     with tempfile.TemporaryDirectory() as tmpdir:
>         filename_template = "file-{n}.arror"
>         i = 0
>         while True:
>             path = os.path.join(tmpdir, filename_template.format(n=i))
>             i += 1
>             with pa.OSFile(path, "w") as sink:
>                 with pa.RecordBatchStreamWriter(sink, schema) as writer:
>                     writer.write_table(table)
>                     print(f"pa.total_allocated_bytes(): {pa.total_allocated_bytes() / MB} mb")
> if __name__ == "__main__":
>     main(sys.argv[1:])
> {code}
> Strangely enough, printing {{total_allocated_bytes}}, it seemed normal.
> {code:python}
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> pa.total_allocated_bytes(): 3.95556640625 mb
> {code}
> Am I using {{RecordBatchStreamWriter}} incorrectly? If not, how can I release the resources?
> [Updates 10/29/2020]
> I tested on {{pyarrow==2.0.0}}. I still see the same issue.
> !Screen Shot 2020-10-29 at 9.22.58 AM.png! 
> I changed {{RecordBatchStreamWriter}} to {{RecordBatchFileWriter}} in my code, i.e.:
> {code:python}
> ...
>             with pa.OSFile(path, "w") as sink:
>                 with pa.RecordBatchFileWriter(sink, schema) as writer:
>                     writer.write_table(table)
>                     print(f"pa.total_allocated_bytes(): {pa.total_allocated_bytes() / MB} mb")
> ...
> {code}
> I observed the same memory profile. I'm wondering if it is caused by [WriteRecordBatch |https://github.com/apache/arrow/blob/maint-0.15.x/cpp/src/arrow/ipc/writer.cc#L594] not being able to release memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)