You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Lubo Slivka (Jira)" <ji...@apache.org> on 2022/06/05 19:06:00 UTC

[jira] [Updated] (ARROW-16697) [FlightRPC][Python] Server seems to leak memory during DoPut

     [ https://issues.apache.org/jira/browse/ARROW-16697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lubo Slivka updated ARROW-16697:
--------------------------------
    Attachment: massif.txt
                massif_client.txt

> [FlightRPC][Python] Server seems to leak memory during DoPut
> ------------------------------------------------------------
>
>                 Key: ARROW-16697
>                 URL: https://issues.apache.org/jira/browse/ARROW-16697
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Lubo Slivka
>            Assignee: David Li
>            Priority: Major
>         Attachments: leak_repro_client.py, leak_repro_server.py, massif.txt, massif_client.txt, sample.csv.gz
>
>
> Hello,
> We are stress testing our Flight RPC server (PyArrow 8.0.0) with write-heavy workloads and are running into what appear to be memory leaks.
> The server is under pressure by a number of separate clients doing DoPut. What we are seeing is that server's memory usage only ever goes up until the server finally gets whacked by k8s due to hitting memory limit.
> I have spent many hours fishing through our code for memory leaks with no success. Even short-circuiting all our custom DoPut handling logic does not alleviate the situation. This led me to create a reproducer that uses nothing but PyArrow and I see the server process memory only increasing similar to what we see on our servers.
> The reproducer is in attachments + I included the test CSV file (20MB) that I use for my tests. Few notes:
>  * The client code has multiple threads, each emulating a separate Flight Client
>  * There are two variants where I see slightly different memory usage characteristic:
>  ** _do_put_with_client_reuse << one client opened at start of thread, then hammering many puts, finally closing the client; leaks appear to happen faster in this variant
>  ** _do_put_with_client_per_request << client opens & connects, does put, then disconnects; loop like this many times; leaks appear to happen slower in this variant if there are less concurrent clients; increasing number of threads 'helps'
>  * The server code handling do_put reads batch-by-batch & does nothing with the chunks
> Also one interesting (but highly likely unrelated thing) that I keep noticing is that _sometimes_ FlightClient takes long time to close (like 5seconds). It happens intermittently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)