You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Josh Bacon (JIRA)" <ji...@apache.org> on 2018/01/09 19:04:00 UTC

[jira] [Comment Edited] (AIRFLOW-57) Rename concurrency configuration variables to be more clear

    [ https://issues.apache.org/jira/browse/AIRFLOW-57?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318964#comment-16318964 ] 

Josh Bacon edited comment on AIRFLOW-57 at 1/9/18 7:03 PM:
-----------------------------------------------------------

FWIW and FYI, regarding +DAG_CONCURRENCY+ and the statement_ "per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this"_. My tests using celery workers have shown that individual workers +*cannot*+ override/change this concurrency "per-worker", the scheduler's configuration is always applied instead. In my tests AIRLFOW__CELERY__CELERYD_CONCURRENCY was increased on my workers to run more concurrent tasks, but no tasks were sent from the scheduler to fill this concurrency (which was limited by a lower AIRFLOW__CORE__DAG_CONCURRENCY).


was (Author: jbacon):
FWIW and FYI, regarding +AIRFLOW__CORE__DAG_CONCURRENCY+ and the statement_ "per_worker would suggest that it's a global setting for workers, but I think you can have workers with different values set for this"_. My tests using celery workers have shown that individual workers +*cannot*+ override/change this concurrency "per-worker", the scheduler's configuration is always applied instead. In my tests AIRLFOW__CELERY__CELERYD_CONCURRENCY was increased on my workers to run more concurrent tasks, but no tasks were sent from the scheduler to fill this concurrency (which was limited by a lower AIRFLOW__CORE__DAG_CONCURRENCY).

> Rename concurrency configuration variables to be more clear
> -----------------------------------------------------------
>
>                 Key: AIRFLOW-57
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-57
>             Project: Apache Airflow
>          Issue Type: Improvement
>    Affects Versions: Airflow 1.7.0
>            Reporter: Bence Nagy
>            Priority: Minor
>              Labels: easyfix, newbie
>
> Currently the following config variables exists for controlling parallel execution limits:
> {code}
> # The amount of parallelism as a setting to the executor. This defines
> # the max number of task instances that should run simultaneously
> # on this airflow installation
> parallelism = 32
> # The number of task instances allowed to run concurrently by the scheduler
> dag_concurrency = 16
> # When not using pools, tasks are run in the "default pool",
> # whose size is guided by this config element
> non_pooled_task_slot_count = 128
> # The maximum number of active DAG runs per DAG
> max_active_runs_per_dag = 16
> {code}
> Let's go through these one by one:
> {{parallelism}}: not a very descriptive name, considering that all the above settings are for parallelism. The description says it sets the maximum task instances for the airflow installation, which is a bit ambiguous — if I have two hosts running airflow workers, I'd have airflow installed on two hosts, so that should be two installations, but based on context 'per installation' here means 'per Airflow state database'. I'd name this {{max_active_tasks}}.
> {{dag_concurrency}}: Despite the name based on the comment this is actually the task concurrency, and it's per worker. I'd name this {{max_active_tasks_for_worker}} ({{per_worker}} would suggest that it's a global setting for workers, but I think you can have workers with different values set for this).
> {{non_pooled_task_slot_count}}: This is a weird one. I'm going to pass on suggesting a name for it because I just can't think of any reason this config variable should exist. We already have a global task instance limit, and we have pools to limit access to certain resources — in what case would someone want to limit access to everything other than certain resources? So, yeah, skipping this one. In case this was needed only due to how pools are implemented, I'd suggest setting the limit to {{sys.maxsize}} and just removing the config variable.
> {{max_active_runs_per_dag}}: This one's kinda alright, but since it seems to be just a default value for the matching {{DAG}} kwarg, it might be nice to reflect that in the name, something like {{default_max_active_runs_for_dags}}
> So let's move on to the {{DAG}} kwargs:
> {{concurrency}}: Again, having a general name like this, coupled with the fact that concurrency is used for something different elsewhere makes this pretty confusing. I'd call this {{max_active_tasks}}.
> {{max_active_runs}}: This one sounds alright to me.
> So. If people agree that this is something that should be fixed, I think it'd be nice to get this in the 1.7.1 release, especially considering that it should be really easy to make the change backwards compatible.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)