You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/12/13 17:12:00 UTC

[jira] [Updated] (ARROW-15079) [C++] Add scheduler to constrain memory of exec plans

     [ https://issues.apache.org/jira/browse/ARROW-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Weston Pace updated ARROW-15079:
--------------------------------
    Description: 
This is a high-level JIRA that will probably consist of a number of subtasks.  Currently the exec plan supports backpressure.  This helps to limit the amount of data read in by a single query.  Spillover can also help reduce the amount of memory used by a single query.

However, if multiple queries are run concurrently, then the system will eventually run out of memory.  I'd like to propose a simple scheduler to start to solve this problem.

 * Initially, the scheduler will be given a configurable max memory target.  We will assume the user is responsible for ensuring this memory is not used elsewhere on the system.  For example, a user may configure 12GB of RAM for the scheduler and the user is responsible for ensuring that 12GB of RAM is not consumed by other processes.  In future JIRA issues we can add more sophistication such as detecting and adapting to the system memory levels.

 * The scheduler will track memory usage by querying for the RSS usage of the process.  An alternative approach would be to use a dedicated memory pool.  This is more flexible (allows for multiple schedulers / query engines in a single process) but wouldn't capture the non-pool allocations (although these should be small).

 * The ExecPlan will be modified to submit its tasks to the scheduler instead of directly to the executor (the scheduler will submit tasks to the executor).  Tasks will be prioritized.  If a task is higher priority then lower priority tasks will be pushed to disk.  Prioritization will be based initially on a tasks position in the ExecPlan.  Tasks closer to the sink will be higher priority.

 * The scheduler is separate from spillover.  Although it may interact with spillover mechanisms to ask for more spillover as memory pressure increases.

  was:
This is a high-level JIRA that will probably consist of a number of subtasks.  Currently the exec plan supports backpressure.  This helps to limit the amount of data read in by a single query.  Spillover can also help reduce the amount of memory used by a single query.

However, if multiple queries are run concurrently, then the system will eventually run out of memory.  I'd like to propose a simple scheduler to start to solve this problem.

 * Initially, the scheduler will be given a configurable max memory target.  We will assume the user is responsible for ensuring this memory is not used elsewhere on the system.  For example, a user may configure 12GB of RAM for the scheduler and the user is responsible for ensuring that 12GB of RAM is not consumed by other processes.

In future JIRA issues we can add more sophistication such as detecting and adapting to the system memory levels.

 * The scheduler will track memory usage by querying for the RSS usage of the process.  An alternative approach would be to use a dedicated memory pool.  This is more flexible (allows for multiple schedulers / query engines in a single process) but wouldn't capture the non-pool allocations (although these should be small).

 * The ExecPlan will be modified to submit its tasks to the scheduler instead of directly to the executor (the scheduler will submit tasks to the executor).  Tasks will be prioritized.  If a task is higher priority then lower priority tasks will be pushed to disk.  Prioritization will be based initially on a tasks position in the ExecPlan.  Tasks closer to the sink will be higher priority.

 * The scheduler is separate from spillover.  Although it may interact with spillover mechanisms to ask for more spillover as memory pressure increases.


> [C++] Add scheduler to constrain memory of exec plans
> -----------------------------------------------------
>
>                 Key: ARROW-15079
>                 URL: https://issues.apache.org/jira/browse/ARROW-15079
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> This is a high-level JIRA that will probably consist of a number of subtasks.  Currently the exec plan supports backpressure.  This helps to limit the amount of data read in by a single query.  Spillover can also help reduce the amount of memory used by a single query.
> However, if multiple queries are run concurrently, then the system will eventually run out of memory.  I'd like to propose a simple scheduler to start to solve this problem.
>  * Initially, the scheduler will be given a configurable max memory target.  We will assume the user is responsible for ensuring this memory is not used elsewhere on the system.  For example, a user may configure 12GB of RAM for the scheduler and the user is responsible for ensuring that 12GB of RAM is not consumed by other processes.  In future JIRA issues we can add more sophistication such as detecting and adapting to the system memory levels.
>  * The scheduler will track memory usage by querying for the RSS usage of the process.  An alternative approach would be to use a dedicated memory pool.  This is more flexible (allows for multiple schedulers / query engines in a single process) but wouldn't capture the non-pool allocations (although these should be small).
>  * The ExecPlan will be modified to submit its tasks to the scheduler instead of directly to the executor (the scheduler will submit tasks to the executor).  Tasks will be prioritized.  If a task is higher priority then lower priority tasks will be pushed to disk.  Prioritization will be based initially on a tasks position in the ExecPlan.  Tasks closer to the sink will be higher priority.
>  * The scheduler is separate from spillover.  Although it may interact with spillover mechanisms to ask for more spillover as memory pressure increases.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)