You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Kouhei Sutou (Jira)" <ji...@apache.org> on 2022/10/20 04:55:00 UTC

[jira] [Updated] (ARROW-14635) [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + cache) used by dataset writes

     [ https://issues.apache.org/jira/browse/ARROW-14635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kouhei Sutou updated ARROW-14635:
---------------------------------
    Fix Version/s: 11.0.0
                       (was: 10.0.0)

> [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + cache) used by dataset writes
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14635
>                 URL: https://issues.apache.org/jira/browse/ARROW-14635
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Ziheng Wang
>            Priority: Major
>              Labels: dataset, pull-request-available
>             Fix For: 11.0.0
>
>          Time Spent: 15h
>  Remaining Estimate: 0h
>
> The dataset writer now correctly applies backpressure.  However, that backpressure is only applied when the write calls slow down.  This only happens when the OS disk cache fills up.
> However, filling up the OS disk cache is undesirable.  It will cause all running processes to get swapped (assuming the system has any swap configured) and will make the system unusable for anything else.
> This typically has no actual benefit to the dataset write.  The marginal performance boost provided by the extra RAM is often not worth the cost.
> One way to do this would be to use direct I/O (although that comes with a plethora of warnings).  Another way might be to flag the output was WONTNEED but I don't know for sure if this works (the OS might still cache it so that it can satisfy the write call quickly).  Another way might be to somehow track how much disk cache is being used for writes but that would get complex.  I'm sure there are other ways I'm just not aware of yet.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)