You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by lr...@intelimetrica.com, lr...@intelimetrica.com on 2018/04/10 00:35:17 UTC

Re: Airflow looses track of Dag Tasks

in my case the dag goes into a failed state when a task is in a running state

On 2018/03/07 14:29:31, "Kamenik, John" <jk...@fourv.com> wrote: 
> Nothing specific that I can see.
> 
> 
> I saw that the online that the docker image pins Celery to 4.0.2.  https://github.com/puckel/docker-airflow.
> 
> 
> With the upgrade to Airflow 1.9 we upgraded all packages including Celery.  It was 4.1.  I have downgraded to Celery 4.0.2 and things appear to be more stable.
> 
> 
> With Celery 4.0.2 in place I have run 30 copies of the failing DAGs 30 times each all in parallel and there were 2 failures total.  So I wouldn't say the issue is fixed completely, but certainly less bad then with Celery 4.1.  Down from a 33% failure rate to a 0.22% failure rate.
> 
> 
> 
> 
> - John K
> 
> ________________________________
> From: Maxime Beauchemin <ma...@gmail.com>
> Sent: Wednesday, March 7, 2018 1:20:05 AM
> To: dev@airflow.incubator.apache.org
> Subject: Re: Airflow looses track of Dag Tasks
> 
> Notice: This message originated outside of SRC.
> 
> Anything else specific? I heard SubDags can have issues under certain
> conditions, is this happening inside SubDags? Anybody else in the community
> has experienced anything like this on 1.9?
> 
> On Mon, Mar 5, 2018 at 9:13 AM, Kamenik, John <jk...@fourv.com> wrote:
> 
> > Airflow scheduler, flower, and webserver are within a statefulset.
> > Workers are in another statefulset.  Dags are shared between scheduler and
> > workers via NFS; CeleryExecuter.
> >
> >
> > The issue happens quite often.  There are hundreds of DAGs that run every
> > day.  Every DAG has failed at least once; most fail at least once every 3
> > days.  On average about 1/3 to 1/2 of all dag runs fail in a given day.  No
> > pattern of failure that we can see; other then it looks like Celery looses
> > track of the task or the Task details in the database get corrupt.
> >
> >
> > There is no obvious error in the output for any of the services: postgres,
> > redis, scheduler, flower, or workers.  If we do find the worker logs
> > (sometimes we cannot) then it usually indicates that the script called runs
> > to completion and is a success.
> >
> >
> > Not sure where the issue might be.
> >
> >
> >
> >
> > - John K
> >
> > ________________________________
> > From: Maxime Beauchemin <ma...@gmail.com>
> > Sent: Monday, March 5, 2018 11:57:16 AM
> > To: dev@airflow.incubator.apache.org
> > Subject: Re: Airflow looses track of Dag Tasks
> >
> > Notice: This message originated outside of SRC.
> >
> > Are you using the Kubernetes executor or running Airflow worker(s) inside
> > persistent pod?
> >
> > How often does that happen? Does it randomly occur on any task, any pattern
> > there?
> >
> > Max
> >
> > On Fri, Mar 2, 2018 at 7:09 AM, Kamenik, John <jk...@fourv.com> wrote:
> >
> > > I have an airflow 1.9 cluster setup on kubernetes and I have an issue
> > > where a random DAG Task shows as fail because it appears that Airflow has
> > > lost track of it.  The cluster consists of a database, redis store,
> > > scheduler, and 14 workers.
> > >
> > >
> > > What happens is the task starts as normal, runs, and exits, but instead
> > of
> > > status being written the Operator, State Date, Job ID, and Hostname are
> > > erased.   Shortly there after an endtime is added and the state is set to
> > > failed.
> > >
> > >
> > > Given the hostname is erased I to brute force find the logs of the worker
> > > that executed the task.  If I can find the Task logs it indicates the
> > > command (BashOperator) ran to completion and exited cleanly.  I don't see
> > > any errors in the airflow scheduler or any workers that would indicate
> > any
> > > issues.  I am not sure what else to debug.
> > >
> > >
> > >
> > >
> > > - John K
> > >
> >
>