You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Jerry Chi <je...@smartnews.com> on 2018/10/23 18:37:35 UTC

Guidelines around how to scale with worker nodes?

Hi everyone,

I'm occasionally observing tasks stuck in "queued" for a long time despite
trying various edits of parameter values in airflow.cfg and I'm guessing it
would help to increase the number of worker nodes (right now I have one
worker node).

Are there any guidelines for:
1. How to determine if the # of worker nodes is indeed the bottleneck
causing tasks to be stuck in "queued" ? It doesn't seem the memory/CPU
usage on the worker node is close to 100%.
2. How to determine the optimal number and CPU/memory specs of the worker
nodes if I want to be able to handle X simultaneous tasks without them
getting stuck in "queued" ?
I'm using CeleryExecutor + RabbitMQ on EC2.

Thanks~
Jerry Chi ジェリー・チー | Data Science Manager | +81-70-2668-5491 | LINE/Skype:
peacej | 카톡: peacej2 | WeChat: jerrychijerry

Re: Guidelines around how to scale with worker nodes?

Posted by Kevin Yang <yr...@gmail.com>.

Hi Jerry,
This may require you to profile the tasks running on your machines
yourself--so you can get an idea how much computation resources your tasks
are consuming. Generally # of worker nodes is the root cause for tasks
gettign stuck in QUEUED state. You can verify that by comparing the # of
running tasks with the number of worker nodes you have to see if all
workers are busy. Additionally you can check out CgroupTaskRunner, which
would allow you to run tasks in cgroups and thus make it possible to run
multiple tasks on one machine.

Cheers,
Kevin Y

On Sun, Oct 28, 2018 at 5:41 PM Jerry Chi <je...@smartnews.com> wrote:

> Sorry, any tips or hints related the below questions? Thank you.
>
> Jerry
>
> 2018年10月24日(水) 3:37、Jerry Chi さん（jerry.chi@smartnews.com）のメッセージ:
>
> > Hi everyone,
> >
> > I'm occasionally observing tasks stuck in "queued" for a long time
> despite
> > trying various edits of parameter values in airflow.cfg and I'm guessing
> it
> > would help to increase the number of worker nodes (right now I have one
> > worker node).
> >
> > Are there any guidelines for:
> > 1. How to determine if the # of worker nodes is indeed the bottleneck
> > causing tasks to be stuck in "queued" ? It doesn't seem the memory/CPU
> > usage on the worker node is close to 100%.
> > 2. How to determine the optimal number and CPU/memory specs of the worker
> > nodes if I want to be able to handle X simultaneous tasks without them
> > getting stuck in "queued" ?
> > I'm using CeleryExecutor + RabbitMQ on EC2.
> >
> > Thanks~
> > Jerry Chi ジェリー・チー | Data Science Manager | +81-70-2668-5491 | LINE/Skype:
> > peacej | 카톡: peacej2 | WeChat: jerrychijerry
> >
>

Re: Guidelines around how to scale with worker nodes?

Posted by Jerry Chi <je...@smartnews.com>.

Sorry, any tips or hints related the below questions? Thank you.

Jerry

2018年10月24日(水) 3:37、Jerry Chi さん（jerry.chi@smartnews.com）のメッセージ:

> Hi everyone,
>
> I'm occasionally observing tasks stuck in "queued" for a long time despite
> trying various edits of parameter values in airflow.cfg and I'm guessing it
> would help to increase the number of worker nodes (right now I have one
> worker node).
>
> Are there any guidelines for:
> 1. How to determine if the # of worker nodes is indeed the bottleneck
> causing tasks to be stuck in "queued" ? It doesn't seem the memory/CPU
> usage on the worker node is close to 100%.
> 2. How to determine the optimal number and CPU/memory specs of the worker
> nodes if I want to be able to handle X simultaneous tasks without them
> getting stuck in "queued" ?
> I'm using CeleryExecutor + RabbitMQ on EC2.
>
> Thanks~
> Jerry Chi ジェリー・チー | Data Science Manager | +81-70-2668-5491 | LINE/Skype:
> peacej | 카톡: peacej2 | WeChat: jerrychijerry
>