You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/11/09 21:32:00 UTC

[jira] [Commented] (ARROW-12030) [C++] Change dataset readahead to be based on available RAM/CPU instead of fixed constants/options

    [ https://issues.apache.org/jira/browse/ARROW-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441374#comment-17441374 ] 

Weston Pace commented on ARROW-12030:
-------------------------------------

This is superseded by ARROW-14648 which I just wrote now that I have more run time / experience with backpressure and readahead.

Per [~apitrou] 's comment earlier, it is better to have a limit specified by the user than to automatically base it on available RAM.

There will still likely be a need for a single "plan-level" limit at some point but that can come later.

> [C++] Change dataset readahead to be based on available RAM/CPU instead of fixed constants/options
> --------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12030
>                 URL: https://issues.apache.org/jira/browse/ARROW-12030
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Weston Pace
>            Priority: Major
>
> Right now in the dataset scanning there are a few places where we add readahead.  At each spot we have to pick some max for how much we read ahead.  Instead of trying to figure out some max it might be nicer to base it on the available RAM.
> On the other hand, it may be the case that there is some set of nice constants that just always works so this can probably wait until we understand more the memory usage of dataset scanning.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)