You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by George Leslie-Waksman <ge...@cloverhealth.com.INVALID> on 2017/08/04 17:24:41 UTC

Re: Completed tasks not being marked as completed

Pretty sure (not 100%) what is happening is:

   1. Scheduler bugs result in task getting scheduled twice
   2. Worker 1 grabs task
   3. Worker 2 grabs task
   4. Worker 1 starts task
   5. Worker 2 starts task
   6. Worker 2 sees that Worker 1 has started and plans to abort
   7. Worker 1 finishes and marks task as done
   8. Worker 2 finishes aborting and marks task as not done


On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:

> Hey Max,
>
> Thanks for the suggestions. I believe it was a retry (I'm using remote
> logging so I can only check after the task completes), but the UI never
> reported it as such. The latest_heartbeat column is in the jobs table, and
> interestingly I do see some running jobs that haven't heartbeated for ~22
> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> listed as the executory class. I can't really correlate these to a specific
> task instance, however, as there doesn't appear to be any key written to
> the jobs table (the dag_id column is all null, and there's no task_id
> column or anything).
>
> Any ideas on what could be making these tasks stop heartbeating regularly?
> That could explain why eventually (after an overnight period of time)
> everything is marked as finished in the Airflow UI: eventually these tasks
> do heartbeat again, but quite long after they are finished running.
>
> Thanks again!
> ᐧ
>
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>
> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Are you sure there hasn't been a retry at that point? [One of] the
> expected
> > behavior is the one I described, where if a task finished without
> reporting
> > it's success [or failure], it will stay marked as RUNNING, but will fail
> to
> > emit a heartbeat (which is a timestamp updated in the task_instance table
> > as last_heartbeat or something).  The scheduler monitors for RUNNING
> tasks
> > without heartbeat and eventually will handle the failure (send emails,
> call
> > on_failure_callback, ...).
> >
> > Looking for heartbeat in the DB might give you some clues as to what is
> > going on.
> >
> > Also there have been versions where we'd occasionally see double
> > triggering, and double firing, which can be confusing. Then you can have
> > different processes reporting their status and debugging those issues can
> > be problematic. I think there's good prevention against that now, using
> > database transactions as the task instance sets itself as RUNNING. I'm
> not
> > sure if 1.8.0 is 100% clean from that regard.
> >
> > Max
> >
> > On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com> wrote:
> >
> > > It happens mostly when the scheduler is catching up. More specifically,
> > > when I load a brand new DAG with a start date in the past. Usually I
> have
> > > it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> same
> > > time.
> > >
> > > What I've also noticed is that the tasks will sit completed in reality
> > but
> > > uncompleted in the Airflow DB for many hours, but if I just leave them
> > all
> > > sitting there over night they all tend to be marked complete the next
> > > morning. Perhaps this points to some sort of Celery timeout or
> connection
> > > retry interval?
> > > ᐧ
> > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> > > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > > > By the time "INFO - Task exited with return code 0" gets logged, the
> > task
> > > > should have been marked as successful by the subprocess. I have no
> > > specific
> > > > intuition as to what the issue may be.
> > > >
> > > > I'm guessing at that point the job stops emitting heartbeat and
> > > eventually
> > > > the scheduler will handle it as a failure?
> > > >
> > > > How often does that happen?
> > > >
> > > > Max
> > > >
> > > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
> wrote:
> > > >
> > > > > From what I can tell, it only affects CeleryExecutor. I've never
> seen
> > > > this
> > > > > behavior with LocalExecutor before.
> > > > >
> > > > > Max, do you know anything about this type of failure mode?
> > > > > ᐧ
> > > > >
> > > > > --
> > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > Engagement
> > > > |
> > > > > New Relic
> > > > >
> > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> thejonas@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > We have the exact same problem. In our case, it's a bash operator
> > > > > starting
> > > > > > a docker container. The container and process it ran exit, but
> the
> > > > > 'docker
> > > > > > run' command is still showing up in the process table, waiting
> for
> > an
> > > > > > event.
> > > > > > I'm trying to switch to LocalExecutor to see if that will help.
> > > > > >
> > > > > > _jonas
> > > > > >
> > > > > >
> > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Has anyone seen the behavior when using CeleryExecutor where
> > > workers
> > > > > will
> > > > > > > finish their tasks ("INFO - Task exited with return code 0"
> shows
> > > in
> > > > > the
> > > > > > > logs) but are never marked as complete in the airflow DB or UI?
> > > > > > Effectively
> > > > > > > this causes tasks to hang even though they are complete, and
> the
> > > DAG
> > > > > will
> > > > > > > not continue.
> > > > > > >
> > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps
> > have a
> > > > > > > workaround?
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > --
> > > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > > Engagement
> > > > > > |
> > > > > > > New Relic
> > > > > > > ᐧ
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Completed tasks not being marked as completed

Posted by Bolke de Bruin <bd...@gmail.com>.

Can you check 1.8.1? 1.8.0 was a big release after a year so some thing slipped though. Would be good to know if it persists on 1.8.1. 

Bolke

Sent from my iPhone

> On 4 Aug 2017, at 21:02, Marc Weil <mw...@newrelic.com> wrote:
> 
> This is on version 1.8.0. I don't recall seeing this behavior when I was
> running 1.7.x.
>> On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com> wrote:
>> 
>> What version of airflow? From the top of my mind 1.8.X
>> 
>> 1) we do use db locking
>> 2) we check the state after we get the lock
>> 3) I don't think the task sets a state if it finds out it is running
>> somewhere else
>> 
>> Maybe we do something at the executor/scheduler level. That I need to
>> investigate if this issue is on a recent version.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 4 Aug 2017, at 19:24, George Leslie-Waksman <ge...@cloverhealth.com.INVALID>
>> wrote:
>>> 
>>> Pretty sure (not 100%) what is happening is:
>>> 
>>>  1. Scheduler bugs result in task getting scheduled twice
>>>  2. Worker 1 grabs task
>>>  3. Worker 2 grabs task
>>>  4. Worker 1 starts task
>>>  5. Worker 2 starts task
>>>  6. Worker 2 sees that Worker 1 has started and plans to abort
>>>  7. Worker 1 finishes and marks task as done
>>>  8. Worker 2 finishes aborting and marks task as not done
>>> 
>>> 
>>>> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:
>>>> 
>>>> Hey Max,
>>>> 
>>>> Thanks for the suggestions. I believe it was a retry (I'm using remote
>>>> logging so I can only check after the task completes), but the UI never
>>>> reported it as such. The latest_heartbeat column is in the jobs table,
>> and
>>>> interestingly I do see some running jobs that haven't heartbeated for
>> ~22
>>>> minutes. They are LocalTaskJob instances with CeleryExecutor properly
>>>> listed as the executory class. I can't really correlate these to a
>> specific
>>>> task instance, however, as there doesn't appear to be any key written to
>>>> the jobs table (the dag_id column is all null, and there's no task_id
>>>> column or anything).
>>>> 
>>>> Any ideas on what could be making these tasks stop heartbeating
>> regularly?
>>>> That could explain why eventually (after an overnight period of time)
>>>> everything is marked as finished in the Airflow UI: eventually these
>> tasks
>>>> do heartbeat again, but quite long after they are finished running.
>>>> 
>>>> Thanks again!
>>>> ᐧ
>>>> 
>>>> --
>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>> Engagement |
>>>> New Relic
>>>> 
>>>> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
>>>> maximebeauchemin@gmail.com> wrote:
>>>> 
>>>>> Are you sure there hasn't been a retry at that point? [One of] the
>>>> expected
>>>>> behavior is the one I described, where if a task finished without
>>>> reporting
>>>>> it's success [or failure], it will stay marked as RUNNING, but will
>> fail
>>>> to
>>>>> emit a heartbeat (which is a timestamp updated in the task_instance
>> table
>>>>> as last_heartbeat or something).  The scheduler monitors for RUNNING
>>>> tasks
>>>>> without heartbeat and eventually will handle the failure (send emails,
>>>> call
>>>>> on_failure_callback, ...).
>>>>> 
>>>>> Looking for heartbeat in the DB might give you some clues as to what is
>>>>> going on.
>>>>> 
>>>>> Also there have been versions where we'd occasionally see double
>>>>> triggering, and double firing, which can be confusing. Then you can
>> have
>>>>> different processes reporting their status and debugging those issues
>> can
>>>>> be problematic. I think there's good prevention against that now, using
>>>>> database transactions as the task instance sets itself as RUNNING. I'm
>>>> not
>>>>> sure if 1.8.0 is 100% clean from that regard.
>>>>> 
>>>>> Max
>>>>> 
>>>>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com>
>> wrote:
>>>>>> 
>>>>>> It happens mostly when the scheduler is catching up. More
>> specifically,
>>>>>> when I load a brand new DAG with a start date in the past. Usually I
>>>> have
>>>>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
>>>> same
>>>>>> time.
>>>>>> 
>>>>>> What I've also noticed is that the tasks will sit completed in reality
>>>>> but
>>>>>> uncompleted in the Airflow DB for many hours, but if I just leave them
>>>>> all
>>>>>> sitting there over night they all tend to be marked complete the next
>>>>>> morning. Perhaps this points to some sort of Celery timeout or
>>>> connection
>>>>>> retry interval?
>>>>>> ᐧ
>>>>>> 
>>>>>> --
>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>> Engagement
>>>>> |
>>>>>> New Relic
>>>>>> 
>>>>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>> 
>>>>>>> By the time "INFO - Task exited with return code 0" gets logged, the
>>>>> task
>>>>>>> should have been marked as successful by the subprocess. I have no
>>>>>> specific
>>>>>>> intuition as to what the issue may be.
>>>>>>> 
>>>>>>> I'm guessing at that point the job stops emitting heartbeat and
>>>>>> eventually
>>>>>>> the scheduler will handle it as a failure?
>>>>>>> 
>>>>>>> How often does that happen?
>>>>>>> 
>>>>>>> Max
>>>>>>> 
>>>>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> From what I can tell, it only affects CeleryExecutor. I've never
>>>> seen
>>>>>>> this
>>>>>>>> behavior with LocalExecutor before.
>>>>>>>> 
>>>>>>>> Max, do you know anything about this type of failure mode?
>>>>>>>> ᐧ
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>>>> Engagement
>>>>>>> |
>>>>>>>> New Relic
>>>>>>>> 
>>>>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
>>>> thejonas@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> We have the exact same problem. In our case, it's a bash operator
>>>>>>>> starting
>>>>>>>>> a docker container. The container and process it ran exit, but
>>>> the
>>>>>>>> 'docker
>>>>>>>>> run' command is still showing up in the process table, waiting
>>>> for
>>>>> an
>>>>>>>>> event.
>>>>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
>>>>>>>>> 
>>>>>>>>> _jonas
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
>>>>>> workers
>>>>>>>> will
>>>>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
>>>> shows
>>>>>> in
>>>>>>>> the
>>>>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
>>>>>>>>> Effectively
>>>>>>>>>> this causes tasks to hang even though they are complete, and
>>>> the
>>>>>> DAG
>>>>>>>> will
>>>>>>>>>> not continue.
>>>>>>>>>> 
>>>>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
>>>>> have a
>>>>>>>>>> workaround?
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>>>>>> Engagement
>>>>>>>>> |
>>>>>>>>>> New Relic
>>>>>>>>>> ᐧ
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> -- 
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic

Re: Completed tasks not being marked as completed

Posted by Arthur Wiedmer <ar...@gmail.com>.

Sorry Marc, must be tired, I responded to the wrong thread.

I was thinking of Airflow CPU usage in the browser.

Apologies.

Best,
Arthur

On Fri, Aug 4, 2017 at 2:26 PM, Marc Weil <mw...@newrelic.com> wrote:

> This particular behavior I was reporting isn't related to the browser. The
> tasks remain in the incorrect state in the database as well.
> On Fri, Aug 4, 2017 at 5:15 PM Arthur Wiedmer <ar...@gmail.com>
> wrote:
>
> > I have noticed the issue more prominently with Chrome after v54, I was
> > wondering if this was related at all. Somehow, airflow seems kinder on
> > Firefox.
> >
> > Does anyone else have a similar experience?
> >
> > Best,
> > Arthur
> >
> > On Fri, Aug 4, 2017 at 12:02 PM, Marc Weil <mw...@newrelic.com> wrote:
> >
> > > This is on version 1.8.0. I don't recall seeing this behavior when I
> was
> > > running 1.7.x.
> > > On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com>
> wrote:
> > >
> > > > What version of airflow? From the top of my mind 1.8.X
> > > >
> > > > 1) we do use db locking
> > > > 2) we check the state after we get the lock
> > > > 3) I don't think the task sets a state if it finds out it is running
> > > > somewhere else
> > > >
> > > > Maybe we do something at the executor/scheduler level. That I need to
> > > > investigate if this issue is on a recent version.
> > > >
> > > > Bolke
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On 4 Aug 2017, at 19:24, George Leslie-Waksman <
> > > george@cloverhealth.com.INVALID>
> > > > wrote:
> > > > >
> > > > > Pretty sure (not 100%) what is happening is:
> > > > >
> > > > >   1. Scheduler bugs result in task getting scheduled twice
> > > > >   2. Worker 1 grabs task
> > > > >   3. Worker 2 grabs task
> > > > >   4. Worker 1 starts task
> > > > >   5. Worker 2 starts task
> > > > >   6. Worker 2 sees that Worker 1 has started and plans to abort
> > > > >   7. Worker 1 finishes and marks task as done
> > > > >   8. Worker 2 finishes aborting and marks task as not done
> > > > >
> > > > >
> > > > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com>
> > wrote:
> > > > >>
> > > > >> Hey Max,
> > > > >>
> > > > >> Thanks for the suggestions. I believe it was a retry (I'm using
> > remote
> > > > >> logging so I can only check after the task completes), but the UI
> > > never
> > > > >> reported it as such. The latest_heartbeat column is in the jobs
> > table,
> > > > and
> > > > >> interestingly I do see some running jobs that haven't heartbeated
> > for
> > > > ~22
> > > > >> minutes. They are LocalTaskJob instances with CeleryExecutor
> > properly
> > > > >> listed as the executory class. I can't really correlate these to a
> > > > specific
> > > > >> task instance, however, as there doesn't appear to be any key
> > written
> > > to
> > > > >> the jobs table (the dag_id column is all null, and there's no
> > task_id
> > > > >> column or anything).
> > > > >>
> > > > >> Any ideas on what could be making these tasks stop heartbeating
> > > > regularly?
> > > > >> That could explain why eventually (after an overnight period of
> > time)
> > > > >> everything is marked as finished in the Airflow UI: eventually
> these
> > > > tasks
> > > > >> do heartbeat again, but quite long after they are finished
> running.
> > > > >>
> > > > >> Thanks again!
> > > > >> ᐧ
> > > > >>
> > > > >> --
> > > > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > Engagement |
> > > > >> New Relic
> > > > >>
> > > > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> > > > >> maximebeauchemin@gmail.com> wrote:
> > > > >>
> > > > >>> Are you sure there hasn't been a retry at that point? [One of]
> the
> > > > >> expected
> > > > >>> behavior is the one I described, where if a task finished without
> > > > >> reporting
> > > > >>> it's success [or failure], it will stay marked as RUNNING, but
> will
> > > > fail
> > > > >> to
> > > > >>> emit a heartbeat (which is a timestamp updated in the
> task_instance
> > > > table
> > > > >>> as last_heartbeat or something).  The scheduler monitors for
> > RUNNING
> > > > >> tasks
> > > > >>> without heartbeat and eventually will handle the failure (send
> > > emails,
> > > > >> call
> > > > >>> on_failure_callback, ...).
> > > > >>>
> > > > >>> Looking for heartbeat in the DB might give you some clues as to
> > what
> > > is
> > > > >>> going on.
> > > > >>>
> > > > >>> Also there have been versions where we'd occasionally see double
> > > > >>> triggering, and double firing, which can be confusing. Then you
> can
> > > > have
> > > > >>> different processes reporting their status and debugging those
> > issues
> > > > can
> > > > >>> be problematic. I think there's good prevention against that now,
> > > using
> > > > >>> database transactions as the task instance sets itself as
> RUNNING.
> > > I'm
> > > > >> not
> > > > >>> sure if 1.8.0 is 100% clean from that regard.
> > > > >>>
> > > > >>> Max
> > > > >>>
> > > > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mweil@newrelic.com
> >
> > > > wrote:
> > > > >>>>
> > > > >>>> It happens mostly when the scheduler is catching up. More
> > > > specifically,
> > > > >>>> when I load a brand new DAG with a start date in the past.
> > Usually I
> > > > >> have
> > > > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at
> > the
> > > > >> same
> > > > >>>> time.
> > > > >>>>
> > > > >>>> What I've also noticed is that the tasks will sit completed in
> > > reality
> > > > >>> but
> > > > >>>> uncompleted in the Airflow DB for many hours, but if I just
> leave
> > > them
> > > > >>> all
> > > > >>>> sitting there over night they all tend to be marked complete the
> > > next
> > > > >>>> morning. Perhaps this points to some sort of Celery timeout or
> > > > >> connection
> > > > >>>> retry interval?
> > > > >>>> ᐧ
> > > > >>>>
> > > > >>>> --
> > > > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > >> Engagement
> > > > >>> |
> > > > >>>> New Relic
> > > > >>>>
> > > > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > > >>>> maximebeauchemin@gmail.com> wrote:
> > > > >>>>
> > > > >>>>> By the time "INFO - Task exited with return code 0" gets
> logged,
> > > the
> > > > >>> task
> > > > >>>>> should have been marked as successful by the subprocess. I have
> > no
> > > > >>>> specific
> > > > >>>>> intuition as to what the issue may be.
> > > > >>>>>
> > > > >>>>> I'm guessing at that point the job stops emitting heartbeat and
> > > > >>>> eventually
> > > > >>>>> the scheduler will handle it as a failure?
> > > > >>>>>
> > > > >>>>> How often does that happen?
> > > > >>>>>
> > > > >>>>> Max
> > > > >>>>>
> > > > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mweil@newrelic.com
> >
> > > > >> wrote:
> > > > >>>>>
> > > > >>>>>> From what I can tell, it only affects CeleryExecutor. I've
> never
> > > > >> seen
> > > > >>>>> this
> > > > >>>>>> behavior with LocalExecutor before.
> > > > >>>>>>
> > > > >>>>>> Max, do you know anything about this type of failure mode?
> > > > >>>>>> ᐧ
> > > > >>>>>>
> > > > >>>>>> --
> > > > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > >>>> Engagement
> > > > >>>>> |
> > > > >>>>>> New Relic
> > > > >>>>>>
> > > > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> > > > >> thejonas@gmail.com>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> We have the exact same problem. In our case, it's a bash
> > operator
> > > > >>>>>> starting
> > > > >>>>>>> a docker container. The container and process it ran exit,
> but
> > > > >> the
> > > > >>>>>> 'docker
> > > > >>>>>>> run' command is still showing up in the process table,
> waiting
> > > > >> for
> > > > >>> an
> > > > >>>>>>> event.
> > > > >>>>>>> I'm trying to switch to LocalExecutor to see if that will
> help.
> > > > >>>>>>>
> > > > >>>>>>> _jonas
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <
> mweil@newrelic.com>
> > > > >>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hello,
> > > > >>>>>>>>
> > > > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> > > > >>>> workers
> > > > >>>>>> will
> > > > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> > > > >> shows
> > > > >>>> in
> > > > >>>>>> the
> > > > >>>>>>>> logs) but are never marked as complete in the airflow DB or
> > UI?
> > > > >>>>>>> Effectively
> > > > >>>>>>>> this causes tasks to hang even though they are complete, and
> > > > >> the
> > > > >>>> DAG
> > > > >>>>>> will
> > > > >>>>>>>> not continue.
> > > > >>>>>>>>
> > > > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> > > > >>> have a
> > > > >>>>>>>> workaround?
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks!
> > > > >>>>>>>>
> > > > >>>>>>>> --
> > > > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing,
> and
> > > > >>>>>> Engagement
> > > > >>>>>>> |
> > > > >>>>>>>> New Relic
> > > > >>>>>>>> ᐧ
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> >
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>

Re: Completed tasks not being marked as completed

Posted by Marc Weil <mw...@newrelic.com>.

This particular behavior I was reporting isn't related to the browser. The
tasks remain in the incorrect state in the database as well.
On Fri, Aug 4, 2017 at 5:15 PM Arthur Wiedmer <ar...@gmail.com>
wrote:

> I have noticed the issue more prominently with Chrome after v54, I was
> wondering if this was related at all. Somehow, airflow seems kinder on
> Firefox.
>
> Does anyone else have a similar experience?
>
> Best,
> Arthur
>
> On Fri, Aug 4, 2017 at 12:02 PM, Marc Weil <mw...@newrelic.com> wrote:
>
> > This is on version 1.8.0. I don't recall seeing this behavior when I was
> > running 1.7.x.
> > On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com> wrote:
> >
> > > What version of airflow? From the top of my mind 1.8.X
> > >
> > > 1) we do use db locking
> > > 2) we check the state after we get the lock
> > > 3) I don't think the task sets a state if it finds out it is running
> > > somewhere else
> > >
> > > Maybe we do something at the executor/scheduler level. That I need to
> > > investigate if this issue is on a recent version.
> > >
> > > Bolke
> > >
> > > Sent from my iPhone
> > >
> > > > On 4 Aug 2017, at 19:24, George Leslie-Waksman <
> > george@cloverhealth.com.INVALID>
> > > wrote:
> > > >
> > > > Pretty sure (not 100%) what is happening is:
> > > >
> > > >   1. Scheduler bugs result in task getting scheduled twice
> > > >   2. Worker 1 grabs task
> > > >   3. Worker 2 grabs task
> > > >   4. Worker 1 starts task
> > > >   5. Worker 2 starts task
> > > >   6. Worker 2 sees that Worker 1 has started and plans to abort
> > > >   7. Worker 1 finishes and marks task as done
> > > >   8. Worker 2 finishes aborting and marks task as not done
> > > >
> > > >
> > > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com>
> wrote:
> > > >>
> > > >> Hey Max,
> > > >>
> > > >> Thanks for the suggestions. I believe it was a retry (I'm using
> remote
> > > >> logging so I can only check after the task completes), but the UI
> > never
> > > >> reported it as such. The latest_heartbeat column is in the jobs
> table,
> > > and
> > > >> interestingly I do see some running jobs that haven't heartbeated
> for
> > > ~22
> > > >> minutes. They are LocalTaskJob instances with CeleryExecutor
> properly
> > > >> listed as the executory class. I can't really correlate these to a
> > > specific
> > > >> task instance, however, as there doesn't appear to be any key
> written
> > to
> > > >> the jobs table (the dag_id column is all null, and there's no
> task_id
> > > >> column or anything).
> > > >>
> > > >> Any ideas on what could be making these tasks stop heartbeating
> > > regularly?
> > > >> That could explain why eventually (after an overnight period of
> time)
> > > >> everything is marked as finished in the Airflow UI: eventually these
> > > tasks
> > > >> do heartbeat again, but quite long after they are finished running.
> > > >>
> > > >> Thanks again!
> > > >> ᐧ
> > > >>
> > > >> --
> > > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > Engagement |
> > > >> New Relic
> > > >>
> > > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> > > >> maximebeauchemin@gmail.com> wrote:
> > > >>
> > > >>> Are you sure there hasn't been a retry at that point? [One of] the
> > > >> expected
> > > >>> behavior is the one I described, where if a task finished without
> > > >> reporting
> > > >>> it's success [or failure], it will stay marked as RUNNING, but will
> > > fail
> > > >> to
> > > >>> emit a heartbeat (which is a timestamp updated in the task_instance
> > > table
> > > >>> as last_heartbeat or something).  The scheduler monitors for
> RUNNING
> > > >> tasks
> > > >>> without heartbeat and eventually will handle the failure (send
> > emails,
> > > >> call
> > > >>> on_failure_callback, ...).
> > > >>>
> > > >>> Looking for heartbeat in the DB might give you some clues as to
> what
> > is
> > > >>> going on.
> > > >>>
> > > >>> Also there have been versions where we'd occasionally see double
> > > >>> triggering, and double firing, which can be confusing. Then you can
> > > have
> > > >>> different processes reporting their status and debugging those
> issues
> > > can
> > > >>> be problematic. I think there's good prevention against that now,
> > using
> > > >>> database transactions as the task instance sets itself as RUNNING.
> > I'm
> > > >> not
> > > >>> sure if 1.8.0 is 100% clean from that regard.
> > > >>>
> > > >>> Max
> > > >>>
> > > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com>
> > > wrote:
> > > >>>>
> > > >>>> It happens mostly when the scheduler is catching up. More
> > > specifically,
> > > >>>> when I load a brand new DAG with a start date in the past.
> Usually I
> > > >> have
> > > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at
> the
> > > >> same
> > > >>>> time.
> > > >>>>
> > > >>>> What I've also noticed is that the tasks will sit completed in
> > reality
> > > >>> but
> > > >>>> uncompleted in the Airflow DB for many hours, but if I just leave
> > them
> > > >>> all
> > > >>>> sitting there over night they all tend to be marked complete the
> > next
> > > >>>> morning. Perhaps this points to some sort of Celery timeout or
> > > >> connection
> > > >>>> retry interval?
> > > >>>> ᐧ
> > > >>>>
> > > >>>> --
> > > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > >> Engagement
> > > >>> |
> > > >>>> New Relic
> > > >>>>
> > > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > >>>> maximebeauchemin@gmail.com> wrote:
> > > >>>>
> > > >>>>> By the time "INFO - Task exited with return code 0" gets logged,
> > the
> > > >>> task
> > > >>>>> should have been marked as successful by the subprocess. I have
> no
> > > >>>> specific
> > > >>>>> intuition as to what the issue may be.
> > > >>>>>
> > > >>>>> I'm guessing at that point the job stops emitting heartbeat and
> > > >>>> eventually
> > > >>>>> the scheduler will handle it as a failure?
> > > >>>>>
> > > >>>>> How often does that happen?
> > > >>>>>
> > > >>>>> Max
> > > >>>>>
> > > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
> > > >> wrote:
> > > >>>>>
> > > >>>>>> From what I can tell, it only affects CeleryExecutor. I've never
> > > >> seen
> > > >>>>> this
> > > >>>>>> behavior with LocalExecutor before.
> > > >>>>>>
> > > >>>>>> Max, do you know anything about this type of failure mode?
> > > >>>>>> ᐧ
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > >>>> Engagement
> > > >>>>> |
> > > >>>>>> New Relic
> > > >>>>>>
> > > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> > > >> thejonas@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> We have the exact same problem. In our case, it's a bash
> operator
> > > >>>>>> starting
> > > >>>>>>> a docker container. The container and process it ran exit, but
> > > >> the
> > > >>>>>> 'docker
> > > >>>>>>> run' command is still showing up in the process table, waiting
> > > >> for
> > > >>> an
> > > >>>>>>> event.
> > > >>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
> > > >>>>>>>
> > > >>>>>>> _jonas
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> > > >>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hello,
> > > >>>>>>>>
> > > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> > > >>>> workers
> > > >>>>>> will
> > > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> > > >> shows
> > > >>>> in
> > > >>>>>> the
> > > >>>>>>>> logs) but are never marked as complete in the airflow DB or
> UI?
> > > >>>>>>> Effectively
> > > >>>>>>>> this causes tasks to hang even though they are complete, and
> > > >> the
> > > >>>> DAG
> > > >>>>>> will
> > > >>>>>>>> not continue.
> > > >>>>>>>>
> > > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> > > >>> have a
> > > >>>>>>>> workaround?
> > > >>>>>>>>
> > > >>>>>>>> Thanks!
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > >>>>>> Engagement
> > > >>>>>>> |
> > > >>>>>>>> New Relic
> > > >>>>>>>> ᐧ
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > --
> > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement
> |
> > New Relic
> >
>
-- 
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic

Re: Completed tasks not being marked as completed

Posted by Arthur Wiedmer <ar...@gmail.com>.

I have noticed the issue more prominently with Chrome after v54, I was
wondering if this was related at all. Somehow, airflow seems kinder on
Firefox.

Does anyone else have a similar experience?

Best,
Arthur

On Fri, Aug 4, 2017 at 12:02 PM, Marc Weil <mw...@newrelic.com> wrote:

> This is on version 1.8.0. I don't recall seeing this behavior when I was
> running 1.7.x.
> On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com> wrote:
>
> > What version of airflow? From the top of my mind 1.8.X
> >
> > 1) we do use db locking
> > 2) we check the state after we get the lock
> > 3) I don't think the task sets a state if it finds out it is running
> > somewhere else
> >
> > Maybe we do something at the executor/scheduler level. That I need to
> > investigate if this issue is on a recent version.
> >
> > Bolke
> >
> > Sent from my iPhone
> >
> > > On 4 Aug 2017, at 19:24, George Leslie-Waksman <
> george@cloverhealth.com.INVALID>
> > wrote:
> > >
> > > Pretty sure (not 100%) what is happening is:
> > >
> > >   1. Scheduler bugs result in task getting scheduled twice
> > >   2. Worker 1 grabs task
> > >   3. Worker 2 grabs task
> > >   4. Worker 1 starts task
> > >   5. Worker 2 starts task
> > >   6. Worker 2 sees that Worker 1 has started and plans to abort
> > >   7. Worker 1 finishes and marks task as done
> > >   8. Worker 2 finishes aborting and marks task as not done
> > >
> > >
> > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:
> > >>
> > >> Hey Max,
> > >>
> > >> Thanks for the suggestions. I believe it was a retry (I'm using remote
> > >> logging so I can only check after the task completes), but the UI
> never
> > >> reported it as such. The latest_heartbeat column is in the jobs table,
> > and
> > >> interestingly I do see some running jobs that haven't heartbeated for
> > ~22
> > >> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> > >> listed as the executory class. I can't really correlate these to a
> > specific
> > >> task instance, however, as there doesn't appear to be any key written
> to
> > >> the jobs table (the dag_id column is all null, and there's no task_id
> > >> column or anything).
> > >>
> > >> Any ideas on what could be making these tasks stop heartbeating
> > regularly?
> > >> That could explain why eventually (after an overnight period of time)
> > >> everything is marked as finished in the Airflow UI: eventually these
> > tasks
> > >> do heartbeat again, but quite long after they are finished running.
> > >>
> > >> Thanks again!
> > >> ᐧ
> > >>
> > >> --
> > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > Engagement |
> > >> New Relic
> > >>
> > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> > >> maximebeauchemin@gmail.com> wrote:
> > >>
> > >>> Are you sure there hasn't been a retry at that point? [One of] the
> > >> expected
> > >>> behavior is the one I described, where if a task finished without
> > >> reporting
> > >>> it's success [or failure], it will stay marked as RUNNING, but will
> > fail
> > >> to
> > >>> emit a heartbeat (which is a timestamp updated in the task_instance
> > table
> > >>> as last_heartbeat or something).  The scheduler monitors for RUNNING
> > >> tasks
> > >>> without heartbeat and eventually will handle the failure (send
> emails,
> > >> call
> > >>> on_failure_callback, ...).
> > >>>
> > >>> Looking for heartbeat in the DB might give you some clues as to what
> is
> > >>> going on.
> > >>>
> > >>> Also there have been versions where we'd occasionally see double
> > >>> triggering, and double firing, which can be confusing. Then you can
> > have
> > >>> different processes reporting their status and debugging those issues
> > can
> > >>> be problematic. I think there's good prevention against that now,
> using
> > >>> database transactions as the task instance sets itself as RUNNING.
> I'm
> > >> not
> > >>> sure if 1.8.0 is 100% clean from that regard.
> > >>>
> > >>> Max
> > >>>
> > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com>
> > wrote:
> > >>>>
> > >>>> It happens mostly when the scheduler is catching up. More
> > specifically,
> > >>>> when I load a brand new DAG with a start date in the past. Usually I
> > >> have
> > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> > >> same
> > >>>> time.
> > >>>>
> > >>>> What I've also noticed is that the tasks will sit completed in
> reality
> > >>> but
> > >>>> uncompleted in the Airflow DB for many hours, but if I just leave
> them
> > >>> all
> > >>>> sitting there over night they all tend to be marked complete the
> next
> > >>>> morning. Perhaps this points to some sort of Celery timeout or
> > >> connection
> > >>>> retry interval?
> > >>>> ᐧ
> > >>>>
> > >>>> --
> > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > >> Engagement
> > >>> |
> > >>>> New Relic
> > >>>>
> > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > >>>> maximebeauchemin@gmail.com> wrote:
> > >>>>
> > >>>>> By the time "INFO - Task exited with return code 0" gets logged,
> the
> > >>> task
> > >>>>> should have been marked as successful by the subprocess. I have no
> > >>>> specific
> > >>>>> intuition as to what the issue may be.
> > >>>>>
> > >>>>> I'm guessing at that point the job stops emitting heartbeat and
> > >>>> eventually
> > >>>>> the scheduler will handle it as a failure?
> > >>>>>
> > >>>>> How often does that happen?
> > >>>>>
> > >>>>> Max
> > >>>>>
> > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
> > >> wrote:
> > >>>>>
> > >>>>>> From what I can tell, it only affects CeleryExecutor. I've never
> > >> seen
> > >>>>> this
> > >>>>>> behavior with LocalExecutor before.
> > >>>>>>
> > >>>>>> Max, do you know anything about this type of failure mode?
> > >>>>>> ᐧ
> > >>>>>>
> > >>>>>> --
> > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > >>>> Engagement
> > >>>>> |
> > >>>>>> New Relic
> > >>>>>>
> > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> > >> thejonas@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> We have the exact same problem. In our case, it's a bash operator
> > >>>>>> starting
> > >>>>>>> a docker container. The container and process it ran exit, but
> > >> the
> > >>>>>> 'docker
> > >>>>>>> run' command is still showing up in the process table, waiting
> > >> for
> > >>> an
> > >>>>>>> event.
> > >>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
> > >>>>>>>
> > >>>>>>> _jonas
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hello,
> > >>>>>>>>
> > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> > >>>> workers
> > >>>>>> will
> > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> > >> shows
> > >>>> in
> > >>>>>> the
> > >>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
> > >>>>>>> Effectively
> > >>>>>>>> this causes tasks to hang even though they are complete, and
> > >> the
> > >>>> DAG
> > >>>>>> will
> > >>>>>>>> not continue.
> > >>>>>>>>
> > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> > >>> have a
> > >>>>>>>> workaround?
> > >>>>>>>>
> > >>>>>>>> Thanks!
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > >>>>>> Engagement
> > >>>>>>> |
> > >>>>>>>> New Relic
> > >>>>>>>> ᐧ
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>

Re: Completed tasks not being marked as completed

Posted by Marc Weil <mw...@newrelic.com>.

This is on version 1.8.0. I don't recall seeing this behavior when I was
running 1.7.x.
On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com> wrote:

> What version of airflow? From the top of my mind 1.8.X
>
> 1) we do use db locking
> 2) we check the state after we get the lock
> 3) I don't think the task sets a state if it finds out it is running
> somewhere else
>
> Maybe we do something at the executor/scheduler level. That I need to
> investigate if this issue is on a recent version.
>
> Bolke
>
> Sent from my iPhone
>
> > On 4 Aug 2017, at 19:24, George Leslie-Waksman <ge...@cloverhealth.com.INVALID>
> wrote:
> >
> > Pretty sure (not 100%) what is happening is:
> >
> >   1. Scheduler bugs result in task getting scheduled twice
> >   2. Worker 1 grabs task
> >   3. Worker 2 grabs task
> >   4. Worker 1 starts task
> >   5. Worker 2 starts task
> >   6. Worker 2 sees that Worker 1 has started and plans to abort
> >   7. Worker 1 finishes and marks task as done
> >   8. Worker 2 finishes aborting and marks task as not done
> >
> >
> >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:
> >>
> >> Hey Max,
> >>
> >> Thanks for the suggestions. I believe it was a retry (I'm using remote
> >> logging so I can only check after the task completes), but the UI never
> >> reported it as such. The latest_heartbeat column is in the jobs table,
> and
> >> interestingly I do see some running jobs that haven't heartbeated for
> ~22
> >> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> >> listed as the executory class. I can't really correlate these to a
> specific
> >> task instance, however, as there doesn't appear to be any key written to
> >> the jobs table (the dag_id column is all null, and there's no task_id
> >> column or anything).
> >>
> >> Any ideas on what could be making these tasks stop heartbeating
> regularly?
> >> That could explain why eventually (after an overnight period of time)
> >> everything is marked as finished in the Airflow UI: eventually these
> tasks
> >> do heartbeat again, but quite long after they are finished running.
> >>
> >> Thanks again!
> >> ᐧ
> >>
> >> --
> >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement |
> >> New Relic
> >>
> >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> >> maximebeauchemin@gmail.com> wrote:
> >>
> >>> Are you sure there hasn't been a retry at that point? [One of] the
> >> expected
> >>> behavior is the one I described, where if a task finished without
> >> reporting
> >>> it's success [or failure], it will stay marked as RUNNING, but will
> fail
> >> to
> >>> emit a heartbeat (which is a timestamp updated in the task_instance
> table
> >>> as last_heartbeat or something).  The scheduler monitors for RUNNING
> >> tasks
> >>> without heartbeat and eventually will handle the failure (send emails,
> >> call
> >>> on_failure_callback, ...).
> >>>
> >>> Looking for heartbeat in the DB might give you some clues as to what is
> >>> going on.
> >>>
> >>> Also there have been versions where we'd occasionally see double
> >>> triggering, and double firing, which can be confusing. Then you can
> have
> >>> different processes reporting their status and debugging those issues
> can
> >>> be problematic. I think there's good prevention against that now, using
> >>> database transactions as the task instance sets itself as RUNNING. I'm
> >> not
> >>> sure if 1.8.0 is 100% clean from that regard.
> >>>
> >>> Max
> >>>
> >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com>
> wrote:
> >>>>
> >>>> It happens mostly when the scheduler is catching up. More
> specifically,
> >>>> when I load a brand new DAG with a start date in the past. Usually I
> >> have
> >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> >> same
> >>>> time.
> >>>>
> >>>> What I've also noticed is that the tasks will sit completed in reality
> >>> but
> >>>> uncompleted in the Airflow DB for many hours, but if I just leave them
> >>> all
> >>>> sitting there over night they all tend to be marked complete the next
> >>>> morning. Perhaps this points to some sort of Celery timeout or
> >> connection
> >>>> retry interval?
> >>>> ᐧ
> >>>>
> >>>> --
> >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >> Engagement
> >>> |
> >>>> New Relic
> >>>>
> >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> >>>> maximebeauchemin@gmail.com> wrote:
> >>>>
> >>>>> By the time "INFO - Task exited with return code 0" gets logged, the
> >>> task
> >>>>> should have been marked as successful by the subprocess. I have no
> >>>> specific
> >>>>> intuition as to what the issue may be.
> >>>>>
> >>>>> I'm guessing at that point the job stops emitting heartbeat and
> >>>> eventually
> >>>>> the scheduler will handle it as a failure?
> >>>>>
> >>>>> How often does that happen?
> >>>>>
> >>>>> Max
> >>>>>
> >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
> >> wrote:
> >>>>>
> >>>>>> From what I can tell, it only affects CeleryExecutor. I've never
> >> seen
> >>>>> this
> >>>>>> behavior with LocalExecutor before.
> >>>>>>
> >>>>>> Max, do you know anything about this type of failure mode?
> >>>>>> ᐧ
> >>>>>>
> >>>>>> --
> >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >>>> Engagement
> >>>>> |
> >>>>>> New Relic
> >>>>>>
> >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> >> thejonas@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> We have the exact same problem. In our case, it's a bash operator
> >>>>>> starting
> >>>>>>> a docker container. The container and process it ran exit, but
> >> the
> >>>>>> 'docker
> >>>>>>> run' command is still showing up in the process table, waiting
> >> for
> >>> an
> >>>>>>> event.
> >>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
> >>>>>>>
> >>>>>>> _jonas
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> >>>> workers
> >>>>>> will
> >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> >> shows
> >>>> in
> >>>>>> the
> >>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
> >>>>>>> Effectively
> >>>>>>>> this causes tasks to hang even though they are complete, and
> >> the
> >>>> DAG
> >>>>>> will
> >>>>>>>> not continue.
> >>>>>>>>
> >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> >>> have a
> >>>>>>>> workaround?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >>>>>> Engagement
> >>>>>>> |
> >>>>>>>> New Relic
> >>>>>>>> ᐧ
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
-- 
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic

Re: Completed tasks not being marked as completed

Posted by Bolke de Bruin <bd...@gmail.com>.

What version of airflow? From the top of my mind 1.8.X

1) we do use db locking
2) we check the state after we get the lock
3) I don't think the task sets a state if it finds out it is running somewhere else

Maybe we do something at the executor/scheduler level. That I need to investigate if this issue is on a recent version. 

Bolke

Sent from my iPhone

> On 4 Aug 2017, at 19:24, George Leslie-Waksman <ge...@cloverhealth.com.INVALID> wrote:
> 
> Pretty sure (not 100%) what is happening is:
> 
>   1. Scheduler bugs result in task getting scheduled twice
>   2. Worker 1 grabs task
>   3. Worker 2 grabs task
>   4. Worker 1 starts task
>   5. Worker 2 starts task
>   6. Worker 2 sees that Worker 1 has started and plans to abort
>   7. Worker 1 finishes and marks task as done
>   8. Worker 2 finishes aborting and marks task as not done
> 
> 
>> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:
>> 
>> Hey Max,
>> 
>> Thanks for the suggestions. I believe it was a retry (I'm using remote
>> logging so I can only check after the task completes), but the UI never
>> reported it as such. The latest_heartbeat column is in the jobs table, and
>> interestingly I do see some running jobs that haven't heartbeated for ~22
>> minutes. They are LocalTaskJob instances with CeleryExecutor properly
>> listed as the executory class. I can't really correlate these to a specific
>> task instance, however, as there doesn't appear to be any key written to
>> the jobs table (the dag_id column is all null, and there's no task_id
>> column or anything).
>> 
>> Any ideas on what could be making these tasks stop heartbeating regularly?
>> That could explain why eventually (after an overnight period of time)
>> everything is marked as finished in the Airflow UI: eventually these tasks
>> do heartbeat again, but quite long after they are finished running.
>> 
>> Thanks again!
>> ᐧ
>> 
>> --
>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
>> New Relic
>> 
>> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
>> maximebeauchemin@gmail.com> wrote:
>> 
>>> Are you sure there hasn't been a retry at that point? [One of] the
>> expected
>>> behavior is the one I described, where if a task finished without
>> reporting
>>> it's success [or failure], it will stay marked as RUNNING, but will fail
>> to
>>> emit a heartbeat (which is a timestamp updated in the task_instance table
>>> as last_heartbeat or something).  The scheduler monitors for RUNNING
>> tasks
>>> without heartbeat and eventually will handle the failure (send emails,
>> call
>>> on_failure_callback, ...).
>>> 
>>> Looking for heartbeat in the DB might give you some clues as to what is
>>> going on.
>>> 
>>> Also there have been versions where we'd occasionally see double
>>> triggering, and double firing, which can be confusing. Then you can have
>>> different processes reporting their status and debugging those issues can
>>> be problematic. I think there's good prevention against that now, using
>>> database transactions as the task instance sets itself as RUNNING. I'm
>> not
>>> sure if 1.8.0 is 100% clean from that regard.
>>> 
>>> Max
>>> 
>>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com> wrote:
>>>> 
>>>> It happens mostly when the scheduler is catching up. More specifically,
>>>> when I load a brand new DAG with a start date in the past. Usually I
>> have
>>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
>> same
>>>> time.
>>>> 
>>>> What I've also noticed is that the tasks will sit completed in reality
>>> but
>>>> uncompleted in the Airflow DB for many hours, but if I just leave them
>>> all
>>>> sitting there over night they all tend to be marked complete the next
>>>> morning. Perhaps this points to some sort of Celery timeout or
>> connection
>>>> retry interval?
>>>> ᐧ
>>>> 
>>>> --
>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>> Engagement
>>> |
>>>> New Relic
>>>> 
>>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
>>>> maximebeauchemin@gmail.com> wrote:
>>>> 
>>>>> By the time "INFO - Task exited with return code 0" gets logged, the
>>> task
>>>>> should have been marked as successful by the subprocess. I have no
>>>> specific
>>>>> intuition as to what the issue may be.
>>>>> 
>>>>> I'm guessing at that point the job stops emitting heartbeat and
>>>> eventually
>>>>> the scheduler will handle it as a failure?
>>>>> 
>>>>> How often does that happen?
>>>>> 
>>>>> Max
>>>>> 
>>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
>> wrote:
>>>>> 
>>>>>> From what I can tell, it only affects CeleryExecutor. I've never
>> seen
>>>>> this
>>>>>> behavior with LocalExecutor before.
>>>>>> 
>>>>>> Max, do you know anything about this type of failure mode?
>>>>>> ᐧ
>>>>>> 
>>>>>> --
>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>> Engagement
>>>>> |
>>>>>> New Relic
>>>>>> 
>>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
>> thejonas@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> We have the exact same problem. In our case, it's a bash operator
>>>>>> starting
>>>>>>> a docker container. The container and process it ran exit, but
>> the
>>>>>> 'docker
>>>>>>> run' command is still showing up in the process table, waiting
>> for
>>> an
>>>>>>> event.
>>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
>>>>>>> 
>>>>>>> _jonas
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
>>>> workers
>>>>>> will
>>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
>> shows
>>>> in
>>>>>> the
>>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
>>>>>>> Effectively
>>>>>>>> this causes tasks to hang even though they are complete, and
>> the
>>>> DAG
>>>>>> will
>>>>>>>> not continue.
>>>>>>>> 
>>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
>>> have a
>>>>>>>> workaround?
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>>>> Engagement
>>>>>>> |
>>>>>>>> New Relic
>>>>>>>> ᐧ
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>