You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Kevin Lam <ke...@fathomhealth.co> on 2018/11/20 15:31:54 UTC

'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Hi,

We run Apache Airflow in Kubernetes in a manner very similar to what is
outlined in puckel/docker-airflow [1] (Celery Executor, Redis for
messaging, Postgres).

Lately, we've encountered some of our Tasks getting stuck in a running
state, and printing out the errors:

[2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not met
for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>,
dependency 'Task Instance Not Already Running' FAILED: Task is already
running, it started on 2018-11-19 23:29:11.974497+00:00.
> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
>
>
Is there anyway to avoid this? Does anyone know what causes this issue?

This is quite problematic. The task is stuck in running state without
making any progress when the above error occurs, and so turning on retries
on doesn't help with getting our DAGs to reliably run to completion.

Thanks!

[1] https://github.com/puckel/docker-airflow

Re: 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Posted by Gabriel Silk <gs...@dropbox.com.INVALID>.
Two questions:
1) Are you eventually seeing the full log for the task, after it finishes?
2) Are you using S3 to store your logs?

On Thu, Feb 14, 2019 at 11:53 AM Dan Stoner <da...@gmail.com> wrote:

> More info!
>
> It appears that the Celery executor will silently fail if the
> credentials to a postgres results_backend are not valid.
>
> For example, we see:
>
> [2019-02-13 20:45:21,132] {{models.py:1353}} INFO - Dependencies not
> met for <TaskInstance: update_table_progress.update_table
> 2019-02-13T20:30:00+00:00 [running]>, dependency 'Task Instance Not
> Already Running' FAILED: Task is already running, it started on
> 2019-02-13 20:45:09.088978+00:00.
> [2019-02-13 20:45:21,132] {{models.py:1353}} INFO - Dependencies not
> met for <TaskInstance: update_table_progress.update_table
> 2019-02-13T20:30:00+00:00 [running]>, dependency 'Task Instance State'
> FAILED: Task is in the 'running' state which is not a valid state for
> execution. The task must be cleared in order to be run.
> [2019-02-13 20:45:21,135] {{logging_mixin.py:95}} INFO - [2019-02-13
> 20:45:21,134] {{jobs.py:2514}} INFO - Task is not able to be run
>
>
> but no database connection failure anywhere in the logs.
>
> After fixing our connection string (via
> AIRFLOW__CELERY__RESULT_BACKEND or result_backend in airflow.cfg),
> these issues went away.
>
>
> Sorry I cannot produce a more solid bug report but hopefully this is a
> breadcrumb for someone.
>
> Dan Stoner
>
> On Wed, Feb 13, 2019 at 10:16 PM Dan Stoner <da...@gmail.com> wrote:
> >
> > We saw this but the task instance state was generally "SUCCESS".
> >
> > In our case, we thought it was due to Redis being used as the results
> > store. There is a WARNING against this right in the operational logs.
> > Google Cloud Composer is surprisingly setup in this fashion.
> >
> > We went back to running our own infrastructure and using postgres as
> > the results store, those issues have not occurred since.
> >
> > The real downside we saw to this error was that our workers were
> > highly underutilized, we were getting terrible overall data
> > throughput, and the workers kept trying to run these tasks they
> > couldn't actually run.
> >
> > - Dan Stoner
> >
> >
> > On Wed, Feb 13, 2019 at 4:16 PM Kevin Lam <ke...@fathomhealth.co> wrote:
> > >
> > > Friendly ping on the above! Has anyone encountered this by chance?
> > >
> > > We're still seeing it occasionally on longer running tasks.
> > >
> > > On Tue, Nov 20, 2018 at 10:31 AM Kevin Lam <ke...@fathomhealth.co>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > We run Apache Airflow in Kubernetes in a manner very similar to what
> is
> > > > outlined in puckel/docker-airflow [1] (Celery Executor, Redis for
> > > > messaging, Postgres).
> > > >
> > > > Lately, we've encountered some of our Tasks getting stuck in a
> running
> > > > state, and printing out the errors:
> > > >
> > > > [2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not
> met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>,
> dependency 'Task Instance Not Already Running' FAILED: Task is already
> running, it started on 2018-11-19 23:29:11.974497+00:00.
> > > >> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not
> met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>,
> dependency 'Task Instance State' FAILED: Task is in the 'running' state
> which is not a valid state for execution. The task must be cleared in order
> to be run.
> > > >>
> > > >>
> > > > Is there anyway to avoid this? Does anyone know what causes this
> issue?
> > > >
> > > > This is quite problematic. The task is stuck in running state without
> > > > making any progress when the above error occurs, and so turning on
> retries
> > > > on doesn't help with getting our DAGs to reliably run to completion.
> > > >
> > > > Thanks!
> > > >
> > > > [1] https://github.com/puckel/docker-airflow
> > > >
>

Re: 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Posted by Dan Stoner <da...@gmail.com>.
More info!

It appears that the Celery executor will silently fail if the
credentials to a postgres results_backend are not valid.

For example, we see:

[2019-02-13 20:45:21,132] {{models.py:1353}} INFO - Dependencies not
met for <TaskInstance: update_table_progress.update_table
2019-02-13T20:30:00+00:00 [running]>, dependency 'Task Instance Not
Already Running' FAILED: Task is already running, it started on
2019-02-13 20:45:09.088978+00:00.
[2019-02-13 20:45:21,132] {{models.py:1353}} INFO - Dependencies not
met for <TaskInstance: update_table_progress.update_table
2019-02-13T20:30:00+00:00 [running]>, dependency 'Task Instance State'
FAILED: Task is in the 'running' state which is not a valid state for
execution. The task must be cleared in order to be run.
[2019-02-13 20:45:21,135] {{logging_mixin.py:95}} INFO - [2019-02-13
20:45:21,134] {{jobs.py:2514}} INFO - Task is not able to be run


but no database connection failure anywhere in the logs.

After fixing our connection string (via
AIRFLOW__CELERY__RESULT_BACKEND or result_backend in airflow.cfg),
these issues went away.


Sorry I cannot produce a more solid bug report but hopefully this is a
breadcrumb for someone.

Dan Stoner

On Wed, Feb 13, 2019 at 10:16 PM Dan Stoner <da...@gmail.com> wrote:
>
> We saw this but the task instance state was generally "SUCCESS".
>
> In our case, we thought it was due to Redis being used as the results
> store. There is a WARNING against this right in the operational logs.
> Google Cloud Composer is surprisingly setup in this fashion.
>
> We went back to running our own infrastructure and using postgres as
> the results store, those issues have not occurred since.
>
> The real downside we saw to this error was that our workers were
> highly underutilized, we were getting terrible overall data
> throughput, and the workers kept trying to run these tasks they
> couldn't actually run.
>
> - Dan Stoner
>
>
> On Wed, Feb 13, 2019 at 4:16 PM Kevin Lam <ke...@fathomhealth.co> wrote:
> >
> > Friendly ping on the above! Has anyone encountered this by chance?
> >
> > We're still seeing it occasionally on longer running tasks.
> >
> > On Tue, Nov 20, 2018 at 10:31 AM Kevin Lam <ke...@fathomhealth.co> wrote:
> >
> > > Hi,
> > >
> > > We run Apache Airflow in Kubernetes in a manner very similar to what is
> > > outlined in puckel/docker-airflow [1] (Celery Executor, Redis for
> > > messaging, Postgres).
> > >
> > > Lately, we've encountered some of our Tasks getting stuck in a running
> > > state, and printing out the errors:
> > >
> > > [2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2018-11-19 23:29:11.974497+00:00.
> > >> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
> > >>
> > >>
> > > Is there anyway to avoid this? Does anyone know what causes this issue?
> > >
> > > This is quite problematic. The task is stuck in running state without
> > > making any progress when the above error occurs, and so turning on retries
> > > on doesn't help with getting our DAGs to reliably run to completion.
> > >
> > > Thanks!
> > >
> > > [1] https://github.com/puckel/docker-airflow
> > >

Re: 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Posted by Dan Stoner <da...@gmail.com>.
We saw this but the task instance state was generally "SUCCESS".

In our case, we thought it was due to Redis being used as the results
store. There is a WARNING against this right in the operational logs.
Google Cloud Composer is surprisingly setup in this fashion.

We went back to running our own infrastructure and using postgres as
the results store, those issues have not occurred since.

The real downside we saw to this error was that our workers were
highly underutilized, we were getting terrible overall data
throughput, and the workers kept trying to run these tasks they
couldn't actually run.

- Dan Stoner


On Wed, Feb 13, 2019 at 4:16 PM Kevin Lam <ke...@fathomhealth.co> wrote:
>
> Friendly ping on the above! Has anyone encountered this by chance?
>
> We're still seeing it occasionally on longer running tasks.
>
> On Tue, Nov 20, 2018 at 10:31 AM Kevin Lam <ke...@fathomhealth.co> wrote:
>
> > Hi,
> >
> > We run Apache Airflow in Kubernetes in a manner very similar to what is
> > outlined in puckel/docker-airflow [1] (Celery Executor, Redis for
> > messaging, Postgres).
> >
> > Lately, we've encountered some of our Tasks getting stuck in a running
> > state, and printing out the errors:
> >
> > [2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2018-11-19 23:29:11.974497+00:00.
> >> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
> >>
> >>
> > Is there anyway to avoid this? Does anyone know what causes this issue?
> >
> > This is quite problematic. The task is stuck in running state without
> > making any progress when the above error occurs, and so turning on retries
> > on doesn't help with getting our DAGs to reliably run to completion.
> >
> > Thanks!
> >
> > [1] https://github.com/puckel/docker-airflow
> >

Re: 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Posted by Daniel Huang <dx...@gmail.com>.
We've been plagued by this as well and it prevents us from setting stricter
retry limits. Similar setup, but using MySQL. Also seeing it more for long
running tasks (sensors).

-Daniel

On Wed, Feb 13, 2019 at 1:48 PM James Meickle
<jm...@quantopian.com.invalid> wrote:

> In some cases this is a double execute in Celery. Two workers grab the same
> task, but the first one to update the metadata db to "running" is the only
> one allowed to run. In our case this leads to confusing, but ultimately not
> incorrect, behavior: the failed task writes a log file and makes that
> available, but the other task is still running on another instance, and
> eventually succeeds.
>
> On Wed, Feb 13, 2019 at 4:16 PM Kevin Lam <ke...@fathomhealth.co> wrote:
>
> > Friendly ping on the above! Has anyone encountered this by chance?
> >
> > We're still seeing it occasionally on longer running tasks.
> >
> > On Tue, Nov 20, 2018 at 10:31 AM Kevin Lam <ke...@fathomhealth.co>
> wrote:
> >
> > > Hi,
> > >
> > > We run Apache Airflow in Kubernetes in a manner very similar to what is
> > > outlined in puckel/docker-airflow [1] (Celery Executor, Redis for
> > > messaging, Postgres).
> > >
> > > Lately, we've encountered some of our Tasks getting stuck in a running
> > > state, and printing out the errors:
> > >
> > > [2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not met
> > for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>,
> > dependency 'Task Instance Not Already Running' FAILED: Task is already
> > running, it started on 2018-11-19 23:29:11.974497+00:00.
> > >> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not met
> > for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>,
> > dependency 'Task Instance State' FAILED: Task is in the 'running' state
> > which is not a valid state for execution. The task must be cleared in
> order
> > to be run.
> > >>
> > >>
> > > Is there anyway to avoid this? Does anyone know what causes this issue?
> > >
> > > This is quite problematic. The task is stuck in running state without
> > > making any progress when the above error occurs, and so turning on
> > retries
> > > on doesn't help with getting our DAGs to reliably run to completion.
> > >
> > > Thanks!
> > >
> > > [1] https://github.com/puckel/docker-airflow
> > >
> >
>

Re: 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Posted by James Meickle <jm...@quantopian.com.INVALID>.
In some cases this is a double execute in Celery. Two workers grab the same
task, but the first one to update the metadata db to "running" is the only
one allowed to run. In our case this leads to confusing, but ultimately not
incorrect, behavior: the failed task writes a log file and makes that
available, but the other task is still running on another instance, and
eventually succeeds.

On Wed, Feb 13, 2019 at 4:16 PM Kevin Lam <ke...@fathomhealth.co> wrote:

> Friendly ping on the above! Has anyone encountered this by chance?
>
> We're still seeing it occasionally on longer running tasks.
>
> On Tue, Nov 20, 2018 at 10:31 AM Kevin Lam <ke...@fathomhealth.co> wrote:
>
> > Hi,
> >
> > We run Apache Airflow in Kubernetes in a manner very similar to what is
> > outlined in puckel/docker-airflow [1] (Celery Executor, Redis for
> > messaging, Postgres).
> >
> > Lately, we've encountered some of our Tasks getting stuck in a running
> > state, and printing out the errors:
> >
> > [2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not met
> for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>,
> dependency 'Task Instance Not Already Running' FAILED: Task is already
> running, it started on 2018-11-19 23:29:11.974497+00:00.
> >> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not met
> for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>,
> dependency 'Task Instance State' FAILED: Task is in the 'running' state
> which is not a valid state for execution. The task must be cleared in order
> to be run.
> >>
> >>
> > Is there anyway to avoid this? Does anyone know what causes this issue?
> >
> > This is quite problematic. The task is stuck in running state without
> > making any progress when the above error occurs, and so turning on
> retries
> > on doesn't help with getting our DAGs to reliably run to completion.
> >
> > Thanks!
> >
> > [1] https://github.com/puckel/docker-airflow
> >
>

Re: 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.

Posted by Kevin Lam <ke...@fathomhealth.co>.
Friendly ping on the above! Has anyone encountered this by chance?

We're still seeing it occasionally on longer running tasks.

On Tue, Nov 20, 2018 at 10:31 AM Kevin Lam <ke...@fathomhealth.co> wrote:

> Hi,
>
> We run Apache Airflow in Kubernetes in a manner very similar to what is
> outlined in puckel/docker-airflow [1] (Celery Executor, Redis for
> messaging, Postgres).
>
> Lately, we've encountered some of our Tasks getting stuck in a running
> state, and printing out the errors:
>
> [2018-11-20 05:31:23,009] {models.py:1329} INFO - Dependencies not met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2018-11-19 23:29:11.974497+00:00.
>> [2018-11-20 05:31:23,016] {models.py:1329} INFO - Dependencies not met for <TaskInstance: BLAH 2018-11-19T19:19:50.757184+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
>>
>>
> Is there anyway to avoid this? Does anyone know what causes this issue?
>
> This is quite problematic. The task is stuck in running state without
> making any progress when the above error occurs, and so turning on retries
> on doesn't help with getting our DAGs to reliably run to completion.
>
> Thanks!
>
> [1] https://github.com/puckel/docker-airflow
>