You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Ken Edwards <ke...@lyft.com> on 2017/01/12 00:09:56 UTC

Complex Dag Runs Getting Stuck: Tasks Not Transitioning on Success

I've successfully been using Airflow the past few months to run relatively
simple dags on a regular cadence. However, I've run into stability issues
with more complex workflows that consist of parent/child dags, some of
which may contain hundreds of tasks. The most common symptom is that state
transitions do not always happen even though the previous task succeeds,
requiring human monitoring/prodding when it gets stuck.

I would appreciate any advice on the things to look at to debug this or
common causes for this.

Thank you,
Ken

Re: Complex Dag Runs Getting Stuck: Tasks Not Transitioning on Success

Posted by Arthur Purvis <ap...@lumoslabs.com>.
sub dags don't really work well if you that's what you mean by
"parent/child DAGs"

On Thu, Jan 12, 2017 at 12:31 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Is there any error in the logs of the task instance (on the worker) towards
> the end? Perhaps the airflow process is unable to communicate with the
> database in some cases? Maybe network or database being unresponsive? If
> that is the case there should be a stack trace in the log.
>
> In any case, the task should stop emitting heartbeats, and the scheduler
> should eventually mark it as failed. If you have retries set up, the
> scheduler should then proceed to starting a retry.
>
> Max
>
> On Wed, Jan 11, 2017 at 4:09 PM, Ken Edwards <ke...@lyft.com> wrote:
>
> > I've successfully been using Airflow the past few months to run
> relatively
> > simple dags on a regular cadence. However, I've run into stability issues
> > with more complex workflows that consist of parent/child dags, some of
> > which may contain hundreds of tasks. The most common symptom is that
> state
> > transitions do not always happen even though the previous task succeeds,
> > requiring human monitoring/prodding when it gets stuck.
> >
> > I would appreciate any advice on the things to look at to debug this or
> > common causes for this.
> >
> > Thank you,
> > Ken
> >
>

Re: Complex Dag Runs Getting Stuck: Tasks Not Transitioning on Success

Posted by Maxime Beauchemin <ma...@gmail.com>.
Is there any error in the logs of the task instance (on the worker) towards
the end? Perhaps the airflow process is unable to communicate with the
database in some cases? Maybe network or database being unresponsive? If
that is the case there should be a stack trace in the log.

In any case, the task should stop emitting heartbeats, and the scheduler
should eventually mark it as failed. If you have retries set up, the
scheduler should then proceed to starting a retry.

Max

On Wed, Jan 11, 2017 at 4:09 PM, Ken Edwards <ke...@lyft.com> wrote:

> I've successfully been using Airflow the past few months to run relatively
> simple dags on a regular cadence. However, I've run into stability issues
> with more complex workflows that consist of parent/child dags, some of
> which may contain hundreds of tasks. The most common symptom is that state
> transitions do not always happen even though the previous task succeeds,
> requiring human monitoring/prodding when it gets stuck.
>
> I would appreciate any advice on the things to look at to debug this or
> common causes for this.
>
> Thank you,
> Ken
>