You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by "Kamenik, John" <jk...@fourv.com> on 2018/03/02 15:09:31 UTC

Airflow looses track of Dag Tasks

I have an airflow 1.9 cluster setup on kubernetes and I have an issue where a random DAG Task shows as fail because it appears that Airflow has lost track of it.  The cluster consists of a database, redis store, scheduler, and 14 workers.


What happens is the task starts as normal, runs, and exits, but instead of status being written the Operator, State Date, Job ID, and Hostname are erased.   Shortly there after an endtime is added and the state is set to failed.


Given the hostname is erased I to brute force find the logs of the worker that executed the task.  If I can find the Task logs it indicates the command (BashOperator) ran to completion and exited cleanly.  I don't see any errors in the airflow scheduler or any workers that would indicate any issues.  I am not sure what else to debug.




- John K

Re: Airflow looses track of Dag Tasks

Posted by "Kamenik, John" <jk...@fourv.com>.

Fokko,


Thanks.  This issue you describe appears to be spot on to the issues we are seeing.  Will stick with Celery 4.0, and keep monitoring manually for failures.




- John K


________________________________
From: fokko@driesprongen.nl <fo...@driesprongen.nl> on behalf of Driesprong, Fokko <fo...@driesprong.frl>
Sent: Wednesday, March 7, 2018 9:39 AM
To: dev@airflow.incubator.apache.org
Subject: Re: Airflow looses track of Dag Tasks

Notice: This message originated outside of SRC.

Hi John,

There are some issues with Celery 4.1: https://github.com/
apache/incubator-airflow/pull/2806 Therefore we are still at 4.0. The
release cycle of Celery seems to be a bit stuck:
https://github.com/celery/celery/issues/4387

The only thing that I've experienced is that the task is begin scheduled
with Celery, but the feedback wasn't properly stored. This will be pushed
back to Postgres, but this wasn't working properly due config mismatches.
This caused Airflow to kick off the job, and dispatch it to Celery, but the
result of the job would never come back. Therefore it would exhaust the
number of slots quickly. All the Celery setups that I'm running are on 4.0
and appear to be running smoothly.

Cheers, Fokko

2018-03-07 15:29 GMT+01:00 Kamenik, John <jk...@fourv.com>:

> Nothing specific that I can see.
>
>
> I saw that the online that the docker image pins Celery to 4.0.2.
> https://github.com/puckel/docker-airflow.
>
>
> With the upgrade to Airflow 1.9 we upgraded all packages including
> Celery.  It was 4.1.  I have downgraded to Celery 4.0.2 and things appear
> to be more stable.
>
>
> With Celery 4.0.2 in place I have run 30 copies of the failing DAGs 30
> times each all in parallel and there were 2 failures total.  So I wouldn't
> say the issue is fixed completely, but certainly less bad then with Celery
> 4.1.  Down from a 33% failure rate to a 0.22% failure rate.
>
>
>
>
> - John K
>
> ________________________________
> From: Maxime Beauchemin <ma...@gmail.com>
> Sent: Wednesday, March 7, 2018 1:20:05 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow looses track of Dag Tasks
>
> Notice: This message originated outside of SRC.
>
> Anything else specific? I heard SubDags can have issues under certain
> conditions, is this happening inside SubDags? Anybody else in the community
> has experienced anything like this on 1.9?
>
> On Mon, Mar 5, 2018 at 9:13 AM, Kamenik, John <jk...@fourv.com> wrote:
>
> > Airflow scheduler, flower, and webserver are within a statefulset.
> > Workers are in another statefulset.  Dags are shared between scheduler
> and
> > workers via NFS; CeleryExecuter.
> >
> >
> > The issue happens quite often.  There are hundreds of DAGs that run every
> > day.  Every DAG has failed at least once; most fail at least once every 3
> > days.  On average about 1/3 to 1/2 of all dag runs fail in a given day.
> No
> > pattern of failure that we can see; other then it looks like Celery
> looses
> > track of the task or the Task details in the database get corrupt.
> >
> >
> > There is no obvious error in the output for any of the services:
> postgres,
> > redis, scheduler, flower, or workers.  If we do find the worker logs
> > (sometimes we cannot) then it usually indicates that the script called
> runs
> > to completion and is a success.
> >
> >
> > Not sure where the issue might be.
> >
> >
> >
> >
> > - John K
> >
> > ________________________________
> > From: Maxime Beauchemin <ma...@gmail.com>
> > Sent: Monday, March 5, 2018 11:57:16 AM
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Airflow looses track of Dag Tasks
> >
> > Notice: This message originated outside of SRC.
> >
> > Are you using the Kubernetes executor or running Airflow worker(s) inside
> > persistent pod?
> >
> > How often does that happen? Does it randomly occur on any task, any
> pattern
> > there?
> >
> > Max
> >
> > On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jk...@fourv.com>
> wrote:
> >
> > > I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> > > where a random DAG Task shows as fail because it appears that Airflow
> has
> > > lost track of it.  The cluster consists of a database, redis store,
> > > scheduler, and 14 workers.
> > >
> > >
> > > What happens is the task starts as normal, runs, and exits, but instead
> > of
> > > status being written the Operator, State Date, Job ID, and Hostname are
> > > erased.   Shortly there after an endtime is added and the state is set
> to
> > > failed.
> > >
> > >
> > > Given the hostname is erased I to brute force find the logs of the
> worker
> > > that executed the task.  If I can find the Task logs it indicates the
> > > command (BashOperator) ran to completion and exited cleanly.  I don't
> see
> > > any errors in the airflow scheduler or any workers that would indicate
> > any
> > > issues.  I am not sure what else to debug.
> > >
> > >
> > >
> > >
> > > - John K
> > >
> >
>

Re: Airflow looses track of Dag Tasks

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Hi John,

There are some issues with Celery 4.1: https://github.com/
apache/incubator-airflow/pull/2806 Therefore we are still at 4.0. The
release cycle of Celery seems to be a bit stuck:
https://github.com/celery/celery/issues/4387

The only thing that I've experienced is that the task is begin scheduled
with Celery, but the feedback wasn't properly stored. This will be pushed
back to Postgres, but this wasn't working properly due config mismatches.
This caused Airflow to kick off the job, and dispatch it to Celery, but the
result of the job would never come back. Therefore it would exhaust the
number of slots quickly. All the Celery setups that I'm running are on 4.0
and appear to be running smoothly.

Cheers, Fokko

2018-03-07 15:29 GMT+01:00 Kamenik, John <jk...@fourv.com>:

> Nothing specific that I can see.
>
>
> I saw that the online that the docker image pins Celery to 4.0.2.
> https://github.com/puckel/docker-airflow.
>
>
> With the upgrade to Airflow 1.9 we upgraded all packages including
> Celery.  It was 4.1.  I have downgraded to Celery 4.0.2 and things appear
> to be more stable.
>
>
> With Celery 4.0.2 in place I have run 30 copies of the failing DAGs 30
> times each all in parallel and there were 2 failures total.  So I wouldn't
> say the issue is fixed completely, but certainly less bad then with Celery
> 4.1.  Down from a 33% failure rate to a 0.22% failure rate.
>
>
>
>
> - John K
>
> ________________________________
> From: Maxime Beauchemin <ma...@gmail.com>
> Sent: Wednesday, March 7, 2018 1:20:05 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow looses track of Dag Tasks
>
> Notice: This message originated outside of SRC.
>
> Anything else specific? I heard SubDags can have issues under certain
> conditions, is this happening inside SubDags? Anybody else in the community
> has experienced anything like this on 1.9?
>
> On Mon, Mar 5, 2018 at 9:13 AM, Kamenik, John <jk...@fourv.com> wrote:
>
> > Airflow scheduler, flower, and webserver are within a statefulset.
> > Workers are in another statefulset.  Dags are shared between scheduler
> and
> > workers via NFS; CeleryExecuter.
> >
> >
> > The issue happens quite often.  There are hundreds of DAGs that run every
> > day.  Every DAG has failed at least once; most fail at least once every 3
> > days.  On average about 1/3 to 1/2 of all dag runs fail in a given day.
> No
> > pattern of failure that we can see; other then it looks like Celery
> looses
> > track of the task or the Task details in the database get corrupt.
> >
> >
> > There is no obvious error in the output for any of the services:
> postgres,
> > redis, scheduler, flower, or workers.  If we do find the worker logs
> > (sometimes we cannot) then it usually indicates that the script called
> runs
> > to completion and is a success.
> >
> >
> > Not sure where the issue might be.
> >
> >
> >
> >
> > - John K
> >
> > ________________________________
> > From: Maxime Beauchemin <ma...@gmail.com>
> > Sent: Monday, March 5, 2018 11:57:16 AM
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Airflow looses track of Dag Tasks
> >
> > Notice: This message originated outside of SRC.
> >
> > Are you using the Kubernetes executor or running Airflow worker(s) inside
> > persistent pod?
> >
> > How often does that happen? Does it randomly occur on any task, any
> pattern
> > there?
> >
> > Max
> >
> > On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jk...@fourv.com>
> wrote:
> >
> > > I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> > > where a random DAG Task shows as fail because it appears that Airflow
> has
> > > lost track of it.  The cluster consists of a database, redis store,
> > > scheduler, and 14 workers.
> > >
> > >
> > > What happens is the task starts as normal, runs, and exits, but instead
> > of
> > > status being written the Operator, State Date, Job ID, and Hostname are
> > > erased.   Shortly there after an endtime is added and the state is set
> to
> > > failed.
> > >
> > >
> > > Given the hostname is erased I to brute force find the logs of the
> worker
> > > that executed the task.  If I can find the Task logs it indicates the
> > > command (BashOperator) ran to completion and exited cleanly.  I don't
> see
> > > any errors in the airflow scheduler or any workers that would indicate
> > any
> > > issues.  I am not sure what else to debug.
> > >
> > >
> > >
> > >
> > > - John K
> > >
> >
>

Re: Airflow looses track of Dag Tasks

Posted by lr...@intelimetrica.com, lr...@intelimetrica.com.

in my case the dag goes into a failed state when a task is in a running state

On 2018/03/07 14:29:31, "Kamenik, John" <jk...@fourv.com> wrote: 
> Nothing specific that I can see.
> 
> 
> I saw that the online that the docker image pins Celery to 4.0.2.  https://github.com/puckel/docker-airflow.
> 
> 
> With the upgrade to Airflow 1.9 we upgraded all packages including Celery.  It was 4.1.  I have downgraded to Celery 4.0.2 and things appear to be more stable.
> 
> 
> With Celery 4.0.2 in place I have run 30 copies of the failing DAGs 30 times each all in parallel and there were 2 failures total.  So I wouldn't say the issue is fixed completely, but certainly less bad then with Celery 4.1.  Down from a 33% failure rate to a 0.22% failure rate.
> 
> 
> 
> 
> - John K
> 
> ________________________________
> From: Maxime Beauchemin <ma...@gmail.com>
> Sent: Wednesday, March 7, 2018 1:20:05 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow looses track of Dag Tasks
> 
> Notice: This message originated outside of SRC.
> 
> Anything else specific? I heard SubDags can have issues under certain
> conditions, is this happening inside SubDags? Anybody else in the community
> has experienced anything like this on 1.9?
> 
> On Mon, Mar 5, 2018 at 9:13 AM, Kamenik, John <jk...@fourv.com> wrote:
> 
> > Airflow scheduler, flower, and webserver are within a statefulset.
> > Workers are in another statefulset.  Dags are shared between scheduler and
> > workers via NFS; CeleryExecuter.
> >
> >
> > The issue happens quite often.  There are hundreds of DAGs that run every
> > day.  Every DAG has failed at least once; most fail at least once every 3
> > days.  On average about 1/3 to 1/2 of all dag runs fail in a given day.  No
> > pattern of failure that we can see; other then it looks like Celery looses
> > track of the task or the Task details in the database get corrupt.
> >
> >
> > There is no obvious error in the output for any of the services: postgres,
> > redis, scheduler, flower, or workers.  If we do find the worker logs
> > (sometimes we cannot) then it usually indicates that the script called runs
> > to completion and is a success.
> >
> >
> > Not sure where the issue might be.
> >
> >
> >
> >
> > - John K
> >
> > ________________________________
> > From: Maxime Beauchemin <ma...@gmail.com>
> > Sent: Monday, March 5, 2018 11:57:16 AM
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Airflow looses track of Dag Tasks
> >
> > Notice: This message originated outside of SRC.
> >
> > Are you using the Kubernetes executor or running Airflow worker(s) inside
> > persistent pod?
> >
> > How often does that happen? Does it randomly occur on any task, any pattern
> > there?
> >
> > Max
> >
> > On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jk...@fourv.com> wrote:
> >
> > > I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> > > where a random DAG Task shows as fail because it appears that Airflow has
> > > lost track of it.  The cluster consists of a database, redis store,
> > > scheduler, and 14 workers.
> > >
> > >
> > > What happens is the task starts as normal, runs, and exits, but instead
> > of
> > > status being written the Operator, State Date, Job ID, and Hostname are
> > > erased.   Shortly there after an endtime is added and the state is set to
> > > failed.
> > >
> > >
> > > Given the hostname is erased I to brute force find the logs of the worker
> > > that executed the task.  If I can find the Task logs it indicates the
> > > command (BashOperator) ran to completion and exited cleanly.  I don't see
> > > any errors in the airflow scheduler or any workers that would indicate
> > any
> > > issues.  I am not sure what else to debug.
> > >
> > >
> > >
> > >
> > > - John K
> > >
> >
>

Re: Airflow looses track of Dag Tasks

Posted by "Kamenik, John" <jk...@fourv.com>.

Nothing specific that I can see.


I saw that the online that the docker image pins Celery to 4.0.2.  https://github.com/puckel/docker-airflow.


With the upgrade to Airflow 1.9 we upgraded all packages including Celery.  It was 4.1.  I have downgraded to Celery 4.0.2 and things appear to be more stable.


With Celery 4.0.2 in place I have run 30 copies of the failing DAGs 30 times each all in parallel and there were 2 failures total.  So I wouldn't say the issue is fixed completely, but certainly less bad then with Celery 4.1.  Down from a 33% failure rate to a 0.22% failure rate.




- John K

________________________________
From: Maxime Beauchemin <ma...@gmail.com>
Sent: Wednesday, March 7, 2018 1:20:05 AM
To: dev@airflow.incubator.apache.org
Subject: Re: Airflow looses track of Dag Tasks

Notice: This message originated outside of SRC.

Anything else specific? I heard SubDags can have issues under certain
conditions, is this happening inside SubDags? Anybody else in the community
has experienced anything like this on 1.9?

On Mon, Mar 5, 2018 at 9:13 AM, Kamenik, John <jk...@fourv.com> wrote:

> Airflow scheduler, flower, and webserver are within a statefulset.
> Workers are in another statefulset.  Dags are shared between scheduler and
> workers via NFS; CeleryExecuter.
>
>
> The issue happens quite often.  There are hundreds of DAGs that run every
> day.  Every DAG has failed at least once; most fail at least once every 3
> days.  On average about 1/3 to 1/2 of all dag runs fail in a given day.  No
> pattern of failure that we can see; other then it looks like Celery looses
> track of the task or the Task details in the database get corrupt.
>
>
> There is no obvious error in the output for any of the services: postgres,
> redis, scheduler, flower, or workers.  If we do find the worker logs
> (sometimes we cannot) then it usually indicates that the script called runs
> to completion and is a success.
>
>
> Not sure where the issue might be.
>
>
>
>
> - John K
>
> ________________________________
> From: Maxime Beauchemin <ma...@gmail.com>
> Sent: Monday, March 5, 2018 11:57:16 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow looses track of Dag Tasks
>
> Notice: This message originated outside of SRC.
>
> Are you using the Kubernetes executor or running Airflow worker(s) inside
> persistent pod?
>
> How often does that happen? Does it randomly occur on any task, any pattern
> there?
>
> Max
>
> On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jk...@fourv.com> wrote:
>
> > I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> > where a random DAG Task shows as fail because it appears that Airflow has
> > lost track of it.  The cluster consists of a database, redis store,
> > scheduler, and 14 workers.
> >
> >
> > What happens is the task starts as normal, runs, and exits, but instead
> of
> > status being written the Operator, State Date, Job ID, and Hostname are
> > erased.   Shortly there after an endtime is added and the state is set to
> > failed.
> >
> >
> > Given the hostname is erased I to brute force find the logs of the worker
> > that executed the task.  If I can find the Task logs it indicates the
> > command (BashOperator) ran to completion and exited cleanly.  I don't see
> > any errors in the airflow scheduler or any workers that would indicate
> any
> > issues.  I am not sure what else to debug.
> >
> >
> >
> >
> > - John K
> >
>

Re: Airflow looses track of Dag Tasks

Posted by Maxime Beauchemin <ma...@gmail.com>.

Anything else specific? I heard SubDags can have issues under certain
conditions, is this happening inside SubDags? Anybody else in the community
has experienced anything like this on 1.9?

On Mon, Mar 5, 2018 at 9:13 AM, Kamenik, John <jk...@fourv.com> wrote:

> Airflow scheduler, flower, and webserver are within a statefulset.
> Workers are in another statefulset.  Dags are shared between scheduler and
> workers via NFS; CeleryExecuter.
>
>
> The issue happens quite often.  There are hundreds of DAGs that run every
> day.  Every DAG has failed at least once; most fail at least once every 3
> days.  On average about 1/3 to 1/2 of all dag runs fail in a given day.  No
> pattern of failure that we can see; other then it looks like Celery looses
> track of the task or the Task details in the database get corrupt.
>
>
> There is no obvious error in the output for any of the services: postgres,
> redis, scheduler, flower, or workers.  If we do find the worker logs
> (sometimes we cannot) then it usually indicates that the script called runs
> to completion and is a success.
>
>
> Not sure where the issue might be.
>
>
>
>
> - John K
>
> ________________________________
> From: Maxime Beauchemin <ma...@gmail.com>
> Sent: Monday, March 5, 2018 11:57:16 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow looses track of Dag Tasks
>
> Notice: This message originated outside of SRC.
>
> Are you using the Kubernetes executor or running Airflow worker(s) inside
> persistent pod?
>
> How often does that happen? Does it randomly occur on any task, any pattern
> there?
>
> Max
>
> On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jk...@fourv.com> wrote:
>
> > I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> > where a random DAG Task shows as fail because it appears that Airflow has
> > lost track of it.  The cluster consists of a database, redis store,
> > scheduler, and 14 workers.
> >
> >
> > What happens is the task starts as normal, runs, and exits, but instead
> of
> > status being written the Operator, State Date, Job ID, and Hostname are
> > erased.   Shortly there after an endtime is added and the state is set to
> > failed.
> >
> >
> > Given the hostname is erased I to brute force find the logs of the worker
> > that executed the task.  If I can find the Task logs it indicates the
> > command (BashOperator) ran to completion and exited cleanly.  I don't see
> > any errors in the airflow scheduler or any workers that would indicate
> any
> > issues.  I am not sure what else to debug.
> >
> >
> >
> >
> > - John K
> >
>

Re: Airflow looses track of Dag Tasks

Posted by "Kamenik, John" <jk...@fourv.com>.

Airflow scheduler, flower, and webserver are within a statefulset.  Workers are in another statefulset.  Dags are shared between scheduler and workers via NFS; CeleryExecuter.

The issue happens quite often.  There are hundreds of DAGs that run every day.  Every DAG has failed at least once; most fail at least once every 3 days.  On average about 1/3 to 1/2 of all dag runs fail in a given day.  No pattern of failure that we can see; other then it looks like Celery looses track of the task or the Task details in the database get corrupt.

There is no obvious error in the output for any of the services: postgres, redis, scheduler, flower, or workers.  If we do find the worker logs (sometimes we cannot) then it usually indicates that the script called runs to completion and is a success.

Not sure where the issue might be.

- John K

________________________________
From: Maxime Beauchemin <ma...@gmail.com>
Sent: Monday, March 5, 2018 11:57:16 AM
To: dev@airflow.incubator.apache.org
Subject: Re: Airflow looses track of Dag Tasks

Notice: This message originated outside of SRC.

Are you using the Kubernetes executor or running Airflow worker(s) inside
persistent pod?

How often does that happen? Does it randomly occur on any task, any pattern
there?

Max

On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jk...@fourv.com> wrote:

> I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> where a random DAG Task shows as fail because it appears that Airflow has
> lost track of it.  The cluster consists of a database, redis store,
> scheduler, and 14 workers.
>
>
> What happens is the task starts as normal, runs, and exits, but instead of
> status being written the Operator, State Date, Job ID, and Hostname are
> erased.   Shortly there after an endtime is added and the state is set to
> failed.
>
>
> Given the hostname is erased I to brute force find the logs of the worker
> that executed the task.  If I can find the Task logs it indicates the
> command (BashOperator) ran to completion and exited cleanly.  I don't see
> any errors in the airflow scheduler or any workers that would indicate any
> issues.  I am not sure what else to debug.
>
>
>
>
> - John K
>

Re: Airflow looses track of Dag Tasks

Posted by Maxime Beauchemin <ma...@gmail.com>.

Are you using the Kubernetes executor or running Airflow worker(s) inside
persistent pod?

How often does that happen? Does it randomly occur on any task, any pattern
there?

Max

On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jk...@fourv.com> wrote:

> I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> where a random DAG Task shows as fail because it appears that Airflow has
> lost track of it.  The cluster consists of a database, redis store,
> scheduler, and 14 workers.
>
>
> What happens is the task starts as normal, runs, and exits, but instead of
> status being written the Operator, State Date, Job ID, and Hostname are
> erased.   Shortly there after an endtime is added and the state is set to
> failed.
>
>
> Given the hostname is erased I to brute force find the logs of the worker
> that executed the task.  If I can find the Task logs it indicates the
> command (BashOperator) ran to completion and exited cleanly.  I don't see
> any errors in the airflow scheduler or any workers that would indicate any
> issues.  I am not sure what else to debug.
>
>
>
>
> - John K
>