You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by David Klosowski <da...@thinknear.com> on 2017/08/07 18:21:25 UTC

Stuck Tasks that don't report status

Hi Airflow Dev List:

Has anyone had cases where tasks get "stuck"?  What I mean by "stuck" is
that tasks show as running through the Airflow UI but never actually run
(and dependent tasks will eventually timeout).

This only happens during our deployments and we replace all the hosts in
our stack (3 workers and 1 host with the scheduler + webserver + flower)
with a dockerized deployment.  We've been deploying to the worker hosts
after the scheduler + webserver + flower host.

It also doesn't occur all the time, which is a bit frustrating to try to
debug.

We have the following settings:

> celery_result_backend = Postgres
> sql_alchmey_conn = Postgres
> broker_url = Redis
> exector = CeleryExecutor

Any thoughts from anyone regarding known issues or observed problems?  I
haven't seen a jira on this after looking through the Airflow jira.

Thanks.

Regards,
David

Re: Stuck Tasks that don't report status

Posted by Alex Guziel <al...@airbnb.com.INVALID>.

I know with a scheduler restart, tasks that may still report as running
even though they are not.

On Wed, Aug 9, 2017 at 6:07 PM, David Klosowski <da...@thinknear.com>
wrote:

> Hi Gerard,
>
> The interesting thing is that we didn't see this issue in 1.7.1.3 but we
> did when upgrading to 1.8.0.
>
> We aren't seeing any timeout on the task in question to be quite honest.
> The state of the task never changes and we have reasonable timeouts on our
> tasks that would notify us.  The task is in fact "stuck" w/o reporting any
> status.  There are other cases where tasks do in fact fail and then go into
> retry state, which we see normally (this happens quite a bit for us on
> deploys).  There is clearly some edge case here where the failure -> retry
> does not happen and the dagrun never updates.
>
> What we do see timeouts on Sensors that depend on those tasks and we've
> added SLAs to some of our important tasks to see issues earlier.
>
> Does anyone know where this code lives?  Is that a function of the
> dagrun_timeout?
>
> Thanks.
>
> Regards,
> David
>
>
>
>
>
>
> On Mon, Aug 7, 2017 at 1:30 PM, Gerard Toonstra <gt...@gmail.com>
> wrote:
>
> > Hi David,
> >
> > When tasks are put on the MQ, they are out of the control of the
> scheduler.
> > The scheduler puts the state of that task instance in "queued".
> >
> > What happens next:
> >
> > 1. A worker picks up the task to run and tries to run it.
> > 2. It first executes a couple of checks against the DB prior to executing
> > this. These are final instance checks to see
> >     if it should still run when the worker is about to pick up the task
> > (another could have processed, started processing, etc).
> > 3. The worker puts the state of the TI in "running".
> > 4. The worker does the work as described in the operator
> > 5. The worker then updates the database with fail or success.
> >
> > If you kill the docker container doing the execution prior to it having
> > updated the state to success or fail,
> > it will get into a situation where a timeout must occur to get airflow to
> > see if the task failed or not. This is because
> > the worker is claiming to be processing the message, but this worker/task
> > got killed.
> >
> > It is actually the task instance updating the database, so if you leave
> > that container running, it will possibly finish
> > and update the db.
> >
> >
> > The task results are also communicated back to the executors and there's
> a
> > check to see if the results agree.
> >
> > You can find this code in models.py / Taskinstance / run()   and any
> > Executor you are using under (airflow/executors).
> >
> >
> > The reason why this happens I think is because docker doesn't really care
> > what's running at the moment, it's assuming 'services',
> > where you may have interruption of services because they are retried all
> > the time anyway. In an environment like airflow,
> > There's a persistent backend database that doesn't automatically retry
> > because it's driven through the scheduler, which only sees
> > a "RUNNING" record in the database.
> >
> > How to deal with this depends on your situation. If you run only short
> > running tasks (up to 5 mins), you could drain the task queue
> > by stopping the scheduler first. This means no new messages are sent to
> the
> > queue, so after 10 mins you should have no tasks running on any workers.
> >
> > Another way is to update the database inbetween, but I'd personally avoid
> > that as much as you can.
> >
> >
> > Not sure if anyone wants to chime in here on how to best deal with this
> in
> > docker?
> >
> > Rgds,
> >
> > Gerard
> >
> >
> > On Mon, Aug 7, 2017 at 8:21 PM, David Klosowski <da...@thinknear.com>
> > wrote:
> >
> > > Hi Airflow Dev List:
> > >
> > > Has anyone had cases where tasks get "stuck"?  What I mean by "stuck"
> is
> > > that tasks show as running through the Airflow UI but never actually
> run
> > > (and dependent tasks will eventually timeout).
> > >
> > > This only happens during our deployments and we replace all the hosts
> in
> > > our stack (3 workers and 1 host with the scheduler + webserver +
> flower)
> > > with a dockerized deployment.  We've been deploying to the worker hosts
> > > after the scheduler + webserver + flower host.
> > >
> > > It also doesn't occur all the time, which is a bit frustrating to try
> to
> > > debug.
> > >
> > > We have the following settings:
> > >
> > > > celery_result_backend = Postgres
> > > > sql_alchmey_conn = Postgres
> > > > broker_url = Redis
> > > > exector = CeleryExecutor
> > >
> > > Any thoughts from anyone regarding known issues or observed problems?
> I
> > > haven't seen a jira on this after looking through the Airflow jira.
> > >
> > > Thanks.
> > >
> > > Regards,
> > > David
> > >
> >
>

Re: Stuck Tasks that don't report status

Posted by David Klosowski <da...@thinknear.com>.

Hi Gerard,

The interesting thing is that we didn't see this issue in 1.7.1.3 but we
did when upgrading to 1.8.0.

We aren't seeing any timeout on the task in question to be quite honest.
The state of the task never changes and we have reasonable timeouts on our
tasks that would notify us.  The task is in fact "stuck" w/o reporting any
status.  There are other cases where tasks do in fact fail and then go into
retry state, which we see normally (this happens quite a bit for us on
deploys).  There is clearly some edge case here where the failure -> retry
does not happen and the dagrun never updates.

What we do see timeouts on Sensors that depend on those tasks and we've
added SLAs to some of our important tasks to see issues earlier.

Does anyone know where this code lives?  Is that a function of the
dagrun_timeout?

Thanks.

Regards,
David






On Mon, Aug 7, 2017 at 1:30 PM, Gerard Toonstra <gt...@gmail.com> wrote:

> Hi David,
>
> When tasks are put on the MQ, they are out of the control of the scheduler.
> The scheduler puts the state of that task instance in "queued".
>
> What happens next:
>
> 1. A worker picks up the task to run and tries to run it.
> 2. It first executes a couple of checks against the DB prior to executing
> this. These are final instance checks to see
>     if it should still run when the worker is about to pick up the task
> (another could have processed, started processing, etc).
> 3. The worker puts the state of the TI in "running".
> 4. The worker does the work as described in the operator
> 5. The worker then updates the database with fail or success.
>
> If you kill the docker container doing the execution prior to it having
> updated the state to success or fail,
> it will get into a situation where a timeout must occur to get airflow to
> see if the task failed or not. This is because
> the worker is claiming to be processing the message, but this worker/task
> got killed.
>
> It is actually the task instance updating the database, so if you leave
> that container running, it will possibly finish
> and update the db.
>
>
> The task results are also communicated back to the executors and there's a
> check to see if the results agree.
>
> You can find this code in models.py / Taskinstance / run()   and any
> Executor you are using under (airflow/executors).
>
>
> The reason why this happens I think is because docker doesn't really care
> what's running at the moment, it's assuming 'services',
> where you may have interruption of services because they are retried all
> the time anyway. In an environment like airflow,
> There's a persistent backend database that doesn't automatically retry
> because it's driven through the scheduler, which only sees
> a "RUNNING" record in the database.
>
> How to deal with this depends on your situation. If you run only short
> running tasks (up to 5 mins), you could drain the task queue
> by stopping the scheduler first. This means no new messages are sent to the
> queue, so after 10 mins you should have no tasks running on any workers.
>
> Another way is to update the database inbetween, but I'd personally avoid
> that as much as you can.
>
>
> Not sure if anyone wants to chime in here on how to best deal with this in
> docker?
>
> Rgds,
>
> Gerard
>
>
> On Mon, Aug 7, 2017 at 8:21 PM, David Klosowski <da...@thinknear.com>
> wrote:
>
> > Hi Airflow Dev List:
> >
> > Has anyone had cases where tasks get "stuck"?  What I mean by "stuck" is
> > that tasks show as running through the Airflow UI but never actually run
> > (and dependent tasks will eventually timeout).
> >
> > This only happens during our deployments and we replace all the hosts in
> > our stack (3 workers and 1 host with the scheduler + webserver + flower)
> > with a dockerized deployment.  We've been deploying to the worker hosts
> > after the scheduler + webserver + flower host.
> >
> > It also doesn't occur all the time, which is a bit frustrating to try to
> > debug.
> >
> > We have the following settings:
> >
> > > celery_result_backend = Postgres
> > > sql_alchmey_conn = Postgres
> > > broker_url = Redis
> > > exector = CeleryExecutor
> >
> > Any thoughts from anyone regarding known issues or observed problems?  I
> > haven't seen a jira on this after looking through the Airflow jira.
> >
> > Thanks.
> >
> > Regards,
> > David
> >
>

Re: Stuck Tasks that don't report status

Posted by Gerard Toonstra <gt...@gmail.com>.

Hi David,

When tasks are put on the MQ, they are out of the control of the scheduler.
The scheduler puts the state of that task instance in "queued".

What happens next:

1. A worker picks up the task to run and tries to run it.
2. It first executes a couple of checks against the DB prior to executing
this. These are final instance checks to see
    if it should still run when the worker is about to pick up the task
(another could have processed, started processing, etc).
3. The worker puts the state of the TI in "running".
4. The worker does the work as described in the operator
5. The worker then updates the database with fail or success.

If you kill the docker container doing the execution prior to it having
updated the state to success or fail,
it will get into a situation where a timeout must occur to get airflow to
see if the task failed or not. This is because
the worker is claiming to be processing the message, but this worker/task
got killed.

It is actually the task instance updating the database, so if you leave
that container running, it will possibly finish
and update the db.

The task results are also communicated back to the executors and there's a
check to see if the results agree.

You can find this code in models.py / Taskinstance / run()   and any
Executor you are using under (airflow/executors).

The reason why this happens I think is because docker doesn't really care
what's running at the moment, it's assuming 'services',
where you may have interruption of services because they are retried all
the time anyway. In an environment like airflow,
There's a persistent backend database that doesn't automatically retry
because it's driven through the scheduler, which only sees
a "RUNNING" record in the database.

How to deal with this depends on your situation. If you run only short
running tasks (up to 5 mins), you could drain the task queue
by stopping the scheduler first. This means no new messages are sent to the
queue, so after 10 mins you should have no tasks running on any workers.

Another way is to update the database inbetween, but I'd personally avoid
that as much as you can.

Not sure if anyone wants to chime in here on how to best deal with this in
docker?

Rgds,

Gerard

On Mon, Aug 7, 2017 at 8:21 PM, David Klosowski <da...@thinknear.com>
wrote:

> Hi Airflow Dev List:
>
> Has anyone had cases where tasks get "stuck"?  What I mean by "stuck" is
> that tasks show as running through the Airflow UI but never actually run
> (and dependent tasks will eventually timeout).
>
> This only happens during our deployments and we replace all the hosts in
> our stack (3 workers and 1 host with the scheduler + webserver + flower)
> with a dockerized deployment.  We've been deploying to the worker hosts
> after the scheduler + webserver + flower host.
>
> It also doesn't occur all the time, which is a bit frustrating to try to
> debug.
>
> We have the following settings:
>
> > celery_result_backend = Postgres
> > sql_alchmey_conn = Postgres
> > broker_url = Redis
> > exector = CeleryExecutor
>
> Any thoughts from anyone regarding known issues or observed problems?  I
> haven't seen a jira on this after looking through the Airflow jira.
>
> Thanks.
>
> Regards,
> David
>