You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Marc Weil <mw...@newrelic.com> on 2017/07/27 20:28:07 UTC

Completed tasks not being marked as completed

Hello,

Has anyone seen the behavior when using CeleryExecutor where workers will
finish their tasks ("INFO - Task exited with return code 0" shows in the
logs) but are never marked as complete in the airflow DB or UI? Effectively
this causes tasks to hang even though they are complete, and the DAG will
not continue.

This is happening on 1.8.0. Anyone else seen this or perhaps have a
workaround?

Thanks!

--
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic
ᐧ

Re: Completed tasks not being marked as completed

Posted by Bolke de Bruin <bd...@gmail.com>.

Can you check 1.8.1? 1.8.0 was a big release after a year so some thing slipped though. Would be good to know if it persists on 1.8.1. 

Bolke

Sent from my iPhone

> On 4 Aug 2017, at 21:02, Marc Weil <mw...@newrelic.com> wrote:
> 
> This is on version 1.8.0. I don't recall seeing this behavior when I was
> running 1.7.x.
>> On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com> wrote:
>> 
>> What version of airflow? From the top of my mind 1.8.X
>> 
>> 1) we do use db locking
>> 2) we check the state after we get the lock
>> 3) I don't think the task sets a state if it finds out it is running
>> somewhere else
>> 
>> Maybe we do something at the executor/scheduler level. That I need to
>> investigate if this issue is on a recent version.
>> 
>> Bolke
>> 
>> Sent from my iPhone
>> 
>>> On 4 Aug 2017, at 19:24, George Leslie-Waksman <ge...@cloverhealth.com.INVALID>
>> wrote:
>>> 
>>> Pretty sure (not 100%) what is happening is:
>>> 
>>>  1. Scheduler bugs result in task getting scheduled twice
>>>  2. Worker 1 grabs task
>>>  3. Worker 2 grabs task
>>>  4. Worker 1 starts task
>>>  5. Worker 2 starts task
>>>  6. Worker 2 sees that Worker 1 has started and plans to abort
>>>  7. Worker 1 finishes and marks task as done
>>>  8. Worker 2 finishes aborting and marks task as not done
>>> 
>>> 
>>>> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:
>>>> 
>>>> Hey Max,
>>>> 
>>>> Thanks for the suggestions. I believe it was a retry (I'm using remote
>>>> logging so I can only check after the task completes), but the UI never
>>>> reported it as such. The latest_heartbeat column is in the jobs table,
>> and
>>>> interestingly I do see some running jobs that haven't heartbeated for
>> ~22
>>>> minutes. They are LocalTaskJob instances with CeleryExecutor properly
>>>> listed as the executory class. I can't really correlate these to a
>> specific
>>>> task instance, however, as there doesn't appear to be any key written to
>>>> the jobs table (the dag_id column is all null, and there's no task_id
>>>> column or anything).
>>>> 
>>>> Any ideas on what could be making these tasks stop heartbeating
>> regularly?
>>>> That could explain why eventually (after an overnight period of time)
>>>> everything is marked as finished in the Airflow UI: eventually these
>> tasks
>>>> do heartbeat again, but quite long after they are finished running.
>>>> 
>>>> Thanks again!
>>>> ᐧ
>>>> 
>>>> --
>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>> Engagement |
>>>> New Relic
>>>> 
>>>> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
>>>> maximebeauchemin@gmail.com> wrote:
>>>> 
>>>>> Are you sure there hasn't been a retry at that point? [One of] the
>>>> expected
>>>>> behavior is the one I described, where if a task finished without
>>>> reporting
>>>>> it's success [or failure], it will stay marked as RUNNING, but will
>> fail
>>>> to
>>>>> emit a heartbeat (which is a timestamp updated in the task_instance
>> table
>>>>> as last_heartbeat or something).  The scheduler monitors for RUNNING
>>>> tasks
>>>>> without heartbeat and eventually will handle the failure (send emails,
>>>> call
>>>>> on_failure_callback, ...).
>>>>> 
>>>>> Looking for heartbeat in the DB might give you some clues as to what is
>>>>> going on.
>>>>> 
>>>>> Also there have been versions where we'd occasionally see double
>>>>> triggering, and double firing, which can be confusing. Then you can
>> have
>>>>> different processes reporting their status and debugging those issues
>> can
>>>>> be problematic. I think there's good prevention against that now, using
>>>>> database transactions as the task instance sets itself as RUNNING. I'm
>>>> not
>>>>> sure if 1.8.0 is 100% clean from that regard.
>>>>> 
>>>>> Max
>>>>> 
>>>>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com>
>> wrote:
>>>>>> 
>>>>>> It happens mostly when the scheduler is catching up. More
>> specifically,
>>>>>> when I load a brand new DAG with a start date in the past. Usually I
>>>> have
>>>>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
>>>> same
>>>>>> time.
>>>>>> 
>>>>>> What I've also noticed is that the tasks will sit completed in reality
>>>>> but
>>>>>> uncompleted in the Airflow DB for many hours, but if I just leave them
>>>>> all
>>>>>> sitting there over night they all tend to be marked complete the next
>>>>>> morning. Perhaps this points to some sort of Celery timeout or
>>>> connection
>>>>>> retry interval?
>>>>>> ᐧ
>>>>>> 
>>>>>> --
>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>> Engagement
>>>>> |
>>>>>> New Relic
>>>>>> 
>>>>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>> 
>>>>>>> By the time "INFO - Task exited with return code 0" gets logged, the
>>>>> task
>>>>>>> should have been marked as successful by the subprocess. I have no
>>>>>> specific
>>>>>>> intuition as to what the issue may be.
>>>>>>> 
>>>>>>> I'm guessing at that point the job stops emitting heartbeat and
>>>>>> eventually
>>>>>>> the scheduler will handle it as a failure?
>>>>>>> 
>>>>>>> How often does that happen?
>>>>>>> 
>>>>>>> Max
>>>>>>> 
>>>>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> From what I can tell, it only affects CeleryExecutor. I've never
>>>> seen
>>>>>>> this
>>>>>>>> behavior with LocalExecutor before.
>>>>>>>> 
>>>>>>>> Max, do you know anything about this type of failure mode?
>>>>>>>> ᐧ
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>>>> Engagement
>>>>>>> |
>>>>>>>> New Relic
>>>>>>>> 
>>>>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
>>>> thejonas@gmail.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> We have the exact same problem. In our case, it's a bash operator
>>>>>>>> starting
>>>>>>>>> a docker container. The container and process it ran exit, but
>>>> the
>>>>>>>> 'docker
>>>>>>>>> run' command is still showing up in the process table, waiting
>>>> for
>>>>> an
>>>>>>>>> event.
>>>>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
>>>>>>>>> 
>>>>>>>>> _jonas
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
>>>>>> workers
>>>>>>>> will
>>>>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
>>>> shows
>>>>>> in
>>>>>>>> the
>>>>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
>>>>>>>>> Effectively
>>>>>>>>>> this causes tasks to hang even though they are complete, and
>>>> the
>>>>>> DAG
>>>>>>>> will
>>>>>>>>>> not continue.
>>>>>>>>>> 
>>>>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
>>>>> have a
>>>>>>>>>> workaround?
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>>>>>> Engagement
>>>>>>>>> |
>>>>>>>>>> New Relic
>>>>>>>>>> ᐧ
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
> -- 
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic

Re: Completed tasks not being marked as completed

Posted by Arthur Wiedmer <ar...@gmail.com>.

Sorry Marc, must be tired, I responded to the wrong thread.

I was thinking of Airflow CPU usage in the browser.

Apologies.

Best,
Arthur

On Fri, Aug 4, 2017 at 2:26 PM, Marc Weil <mw...@newrelic.com> wrote:

> This particular behavior I was reporting isn't related to the browser. The
> tasks remain in the incorrect state in the database as well.
> On Fri, Aug 4, 2017 at 5:15 PM Arthur Wiedmer <ar...@gmail.com>
> wrote:
>
> > I have noticed the issue more prominently with Chrome after v54, I was
> > wondering if this was related at all. Somehow, airflow seems kinder on
> > Firefox.
> >
> > Does anyone else have a similar experience?
> >
> > Best,
> > Arthur
> >
> > On Fri, Aug 4, 2017 at 12:02 PM, Marc Weil <mw...@newrelic.com> wrote:
> >
> > > This is on version 1.8.0. I don't recall seeing this behavior when I
> was
> > > running 1.7.x.
> > > On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com>
> wrote:
> > >
> > > > What version of airflow? From the top of my mind 1.8.X
> > > >
> > > > 1) we do use db locking
> > > > 2) we check the state after we get the lock
> > > > 3) I don't think the task sets a state if it finds out it is running
> > > > somewhere else
> > > >
> > > > Maybe we do something at the executor/scheduler level. That I need to
> > > > investigate if this issue is on a recent version.
> > > >
> > > > Bolke
> > > >
> > > > Sent from my iPhone
> > > >
> > > > > On 4 Aug 2017, at 19:24, George Leslie-Waksman <
> > > george@cloverhealth.com.INVALID>
> > > > wrote:
> > > > >
> > > > > Pretty sure (not 100%) what is happening is:
> > > > >
> > > > >   1. Scheduler bugs result in task getting scheduled twice
> > > > >   2. Worker 1 grabs task
> > > > >   3. Worker 2 grabs task
> > > > >   4. Worker 1 starts task
> > > > >   5. Worker 2 starts task
> > > > >   6. Worker 2 sees that Worker 1 has started and plans to abort
> > > > >   7. Worker 1 finishes and marks task as done
> > > > >   8. Worker 2 finishes aborting and marks task as not done
> > > > >
> > > > >
> > > > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com>
> > wrote:
> > > > >>
> > > > >> Hey Max,
> > > > >>
> > > > >> Thanks for the suggestions. I believe it was a retry (I'm using
> > remote
> > > > >> logging so I can only check after the task completes), but the UI
> > > never
> > > > >> reported it as such. The latest_heartbeat column is in the jobs
> > table,
> > > > and
> > > > >> interestingly I do see some running jobs that haven't heartbeated
> > for
> > > > ~22
> > > > >> minutes. They are LocalTaskJob instances with CeleryExecutor
> > properly
> > > > >> listed as the executory class. I can't really correlate these to a
> > > > specific
> > > > >> task instance, however, as there doesn't appear to be any key
> > written
> > > to
> > > > >> the jobs table (the dag_id column is all null, and there's no
> > task_id
> > > > >> column or anything).
> > > > >>
> > > > >> Any ideas on what could be making these tasks stop heartbeating
> > > > regularly?
> > > > >> That could explain why eventually (after an overnight period of
> > time)
> > > > >> everything is marked as finished in the Airflow UI: eventually
> these
> > > > tasks
> > > > >> do heartbeat again, but quite long after they are finished
> running.
> > > > >>
> > > > >> Thanks again!
> > > > >> ᐧ
> > > > >>
> > > > >> --
> > > > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > Engagement |
> > > > >> New Relic
> > > > >>
> > > > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> > > > >> maximebeauchemin@gmail.com> wrote:
> > > > >>
> > > > >>> Are you sure there hasn't been a retry at that point? [One of]
> the
> > > > >> expected
> > > > >>> behavior is the one I described, where if a task finished without
> > > > >> reporting
> > > > >>> it's success [or failure], it will stay marked as RUNNING, but
> will
> > > > fail
> > > > >> to
> > > > >>> emit a heartbeat (which is a timestamp updated in the
> task_instance
> > > > table
> > > > >>> as last_heartbeat or something).  The scheduler monitors for
> > RUNNING
> > > > >> tasks
> > > > >>> without heartbeat and eventually will handle the failure (send
> > > emails,
> > > > >> call
> > > > >>> on_failure_callback, ...).
> > > > >>>
> > > > >>> Looking for heartbeat in the DB might give you some clues as to
> > what
> > > is
> > > > >>> going on.
> > > > >>>
> > > > >>> Also there have been versions where we'd occasionally see double
> > > > >>> triggering, and double firing, which can be confusing. Then you
> can
> > > > have
> > > > >>> different processes reporting their status and debugging those
> > issues
> > > > can
> > > > >>> be problematic. I think there's good prevention against that now,
> > > using
> > > > >>> database transactions as the task instance sets itself as
> RUNNING.
> > > I'm
> > > > >> not
> > > > >>> sure if 1.8.0 is 100% clean from that regard.
> > > > >>>
> > > > >>> Max
> > > > >>>
> > > > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mweil@newrelic.com
> >
> > > > wrote:
> > > > >>>>
> > > > >>>> It happens mostly when the scheduler is catching up. More
> > > > specifically,
> > > > >>>> when I load a brand new DAG with a start date in the past.
> > Usually I
> > > > >> have
> > > > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at
> > the
> > > > >> same
> > > > >>>> time.
> > > > >>>>
> > > > >>>> What I've also noticed is that the tasks will sit completed in
> > > reality
> > > > >>> but
> > > > >>>> uncompleted in the Airflow DB for many hours, but if I just
> leave
> > > them
> > > > >>> all
> > > > >>>> sitting there over night they all tend to be marked complete the
> > > next
> > > > >>>> morning. Perhaps this points to some sort of Celery timeout or
> > > > >> connection
> > > > >>>> retry interval?
> > > > >>>> ᐧ
> > > > >>>>
> > > > >>>> --
> > > > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > >> Engagement
> > > > >>> |
> > > > >>>> New Relic
> > > > >>>>
> > > > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > > >>>> maximebeauchemin@gmail.com> wrote:
> > > > >>>>
> > > > >>>>> By the time "INFO - Task exited with return code 0" gets
> logged,
> > > the
> > > > >>> task
> > > > >>>>> should have been marked as successful by the subprocess. I have
> > no
> > > > >>>> specific
> > > > >>>>> intuition as to what the issue may be.
> > > > >>>>>
> > > > >>>>> I'm guessing at that point the job stops emitting heartbeat and
> > > > >>>> eventually
> > > > >>>>> the scheduler will handle it as a failure?
> > > > >>>>>
> > > > >>>>> How often does that happen?
> > > > >>>>>
> > > > >>>>> Max
> > > > >>>>>
> > > > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mweil@newrelic.com
> >
> > > > >> wrote:
> > > > >>>>>
> > > > >>>>>> From what I can tell, it only affects CeleryExecutor. I've
> never
> > > > >> seen
> > > > >>>>> this
> > > > >>>>>> behavior with LocalExecutor before.
> > > > >>>>>>
> > > > >>>>>> Max, do you know anything about this type of failure mode?
> > > > >>>>>> ᐧ
> > > > >>>>>>
> > > > >>>>>> --
> > > > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > >>>> Engagement
> > > > >>>>> |
> > > > >>>>>> New Relic
> > > > >>>>>>
> > > > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> > > > >> thejonas@gmail.com>
> > > > >>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> We have the exact same problem. In our case, it's a bash
> > operator
> > > > >>>>>> starting
> > > > >>>>>>> a docker container. The container and process it ran exit,
> but
> > > > >> the
> > > > >>>>>> 'docker
> > > > >>>>>>> run' command is still showing up in the process table,
> waiting
> > > > >> for
> > > > >>> an
> > > > >>>>>>> event.
> > > > >>>>>>> I'm trying to switch to LocalExecutor to see if that will
> help.
> > > > >>>>>>>
> > > > >>>>>>> _jonas
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <
> mweil@newrelic.com>
> > > > >>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> Hello,
> > > > >>>>>>>>
> > > > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> > > > >>>> workers
> > > > >>>>>> will
> > > > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> > > > >> shows
> > > > >>>> in
> > > > >>>>>> the
> > > > >>>>>>>> logs) but are never marked as complete in the airflow DB or
> > UI?
> > > > >>>>>>> Effectively
> > > > >>>>>>>> this causes tasks to hang even though they are complete, and
> > > > >> the
> > > > >>>> DAG
> > > > >>>>>> will
> > > > >>>>>>>> not continue.
> > > > >>>>>>>>
> > > > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> > > > >>> have a
> > > > >>>>>>>> workaround?
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks!
> > > > >>>>>>>>
> > > > >>>>>>>> --
> > > > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing,
> and
> > > > >>>>>> Engagement
> > > > >>>>>>> |
> > > > >>>>>>>> New Relic
> > > > >>>>>>>> ᐧ
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> >
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>

Re: Completed tasks not being marked as completed

Posted by Marc Weil <mw...@newrelic.com>.

This particular behavior I was reporting isn't related to the browser. The
tasks remain in the incorrect state in the database as well.
On Fri, Aug 4, 2017 at 5:15 PM Arthur Wiedmer <ar...@gmail.com>
wrote:

> I have noticed the issue more prominently with Chrome after v54, I was
> wondering if this was related at all. Somehow, airflow seems kinder on
> Firefox.
>
> Does anyone else have a similar experience?
>
> Best,
> Arthur
>
> On Fri, Aug 4, 2017 at 12:02 PM, Marc Weil <mw...@newrelic.com> wrote:
>
> > This is on version 1.8.0. I don't recall seeing this behavior when I was
> > running 1.7.x.
> > On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com> wrote:
> >
> > > What version of airflow? From the top of my mind 1.8.X
> > >
> > > 1) we do use db locking
> > > 2) we check the state after we get the lock
> > > 3) I don't think the task sets a state if it finds out it is running
> > > somewhere else
> > >
> > > Maybe we do something at the executor/scheduler level. That I need to
> > > investigate if this issue is on a recent version.
> > >
> > > Bolke
> > >
> > > Sent from my iPhone
> > >
> > > > On 4 Aug 2017, at 19:24, George Leslie-Waksman <
> > george@cloverhealth.com.INVALID>
> > > wrote:
> > > >
> > > > Pretty sure (not 100%) what is happening is:
> > > >
> > > >   1. Scheduler bugs result in task getting scheduled twice
> > > >   2. Worker 1 grabs task
> > > >   3. Worker 2 grabs task
> > > >   4. Worker 1 starts task
> > > >   5. Worker 2 starts task
> > > >   6. Worker 2 sees that Worker 1 has started and plans to abort
> > > >   7. Worker 1 finishes and marks task as done
> > > >   8. Worker 2 finishes aborting and marks task as not done
> > > >
> > > >
> > > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com>
> wrote:
> > > >>
> > > >> Hey Max,
> > > >>
> > > >> Thanks for the suggestions. I believe it was a retry (I'm using
> remote
> > > >> logging so I can only check after the task completes), but the UI
> > never
> > > >> reported it as such. The latest_heartbeat column is in the jobs
> table,
> > > and
> > > >> interestingly I do see some running jobs that haven't heartbeated
> for
> > > ~22
> > > >> minutes. They are LocalTaskJob instances with CeleryExecutor
> properly
> > > >> listed as the executory class. I can't really correlate these to a
> > > specific
> > > >> task instance, however, as there doesn't appear to be any key
> written
> > to
> > > >> the jobs table (the dag_id column is all null, and there's no
> task_id
> > > >> column or anything).
> > > >>
> > > >> Any ideas on what could be making these tasks stop heartbeating
> > > regularly?
> > > >> That could explain why eventually (after an overnight period of
> time)
> > > >> everything is marked as finished in the Airflow UI: eventually these
> > > tasks
> > > >> do heartbeat again, but quite long after they are finished running.
> > > >>
> > > >> Thanks again!
> > > >> ᐧ
> > > >>
> > > >> --
> > > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > Engagement |
> > > >> New Relic
> > > >>
> > > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> > > >> maximebeauchemin@gmail.com> wrote:
> > > >>
> > > >>> Are you sure there hasn't been a retry at that point? [One of] the
> > > >> expected
> > > >>> behavior is the one I described, where if a task finished without
> > > >> reporting
> > > >>> it's success [or failure], it will stay marked as RUNNING, but will
> > > fail
> > > >> to
> > > >>> emit a heartbeat (which is a timestamp updated in the task_instance
> > > table
> > > >>> as last_heartbeat or something).  The scheduler monitors for
> RUNNING
> > > >> tasks
> > > >>> without heartbeat and eventually will handle the failure (send
> > emails,
> > > >> call
> > > >>> on_failure_callback, ...).
> > > >>>
> > > >>> Looking for heartbeat in the DB might give you some clues as to
> what
> > is
> > > >>> going on.
> > > >>>
> > > >>> Also there have been versions where we'd occasionally see double
> > > >>> triggering, and double firing, which can be confusing. Then you can
> > > have
> > > >>> different processes reporting their status and debugging those
> issues
> > > can
> > > >>> be problematic. I think there's good prevention against that now,
> > using
> > > >>> database transactions as the task instance sets itself as RUNNING.
> > I'm
> > > >> not
> > > >>> sure if 1.8.0 is 100% clean from that regard.
> > > >>>
> > > >>> Max
> > > >>>
> > > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com>
> > > wrote:
> > > >>>>
> > > >>>> It happens mostly when the scheduler is catching up. More
> > > specifically,
> > > >>>> when I load a brand new DAG with a start date in the past.
> Usually I
> > > >> have
> > > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at
> the
> > > >> same
> > > >>>> time.
> > > >>>>
> > > >>>> What I've also noticed is that the tasks will sit completed in
> > reality
> > > >>> but
> > > >>>> uncompleted in the Airflow DB for many hours, but if I just leave
> > them
> > > >>> all
> > > >>>> sitting there over night they all tend to be marked complete the
> > next
> > > >>>> morning. Perhaps this points to some sort of Celery timeout or
> > > >> connection
> > > >>>> retry interval?
> > > >>>> ᐧ
> > > >>>>
> > > >>>> --
> > > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > >> Engagement
> > > >>> |
> > > >>>> New Relic
> > > >>>>
> > > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > >>>> maximebeauchemin@gmail.com> wrote:
> > > >>>>
> > > >>>>> By the time "INFO - Task exited with return code 0" gets logged,
> > the
> > > >>> task
> > > >>>>> should have been marked as successful by the subprocess. I have
> no
> > > >>>> specific
> > > >>>>> intuition as to what the issue may be.
> > > >>>>>
> > > >>>>> I'm guessing at that point the job stops emitting heartbeat and
> > > >>>> eventually
> > > >>>>> the scheduler will handle it as a failure?
> > > >>>>>
> > > >>>>> How often does that happen?
> > > >>>>>
> > > >>>>> Max
> > > >>>>>
> > > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
> > > >> wrote:
> > > >>>>>
> > > >>>>>> From what I can tell, it only affects CeleryExecutor. I've never
> > > >> seen
> > > >>>>> this
> > > >>>>>> behavior with LocalExecutor before.
> > > >>>>>>
> > > >>>>>> Max, do you know anything about this type of failure mode?
> > > >>>>>> ᐧ
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > >>>> Engagement
> > > >>>>> |
> > > >>>>>> New Relic
> > > >>>>>>
> > > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> > > >> thejonas@gmail.com>
> > > >>>>>> wrote:
> > > >>>>>>
> > > >>>>>>> We have the exact same problem. In our case, it's a bash
> operator
> > > >>>>>> starting
> > > >>>>>>> a docker container. The container and process it ran exit, but
> > > >> the
> > > >>>>>> 'docker
> > > >>>>>>> run' command is still showing up in the process table, waiting
> > > >> for
> > > >>> an
> > > >>>>>>> event.
> > > >>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
> > > >>>>>>>
> > > >>>>>>> _jonas
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> > > >>>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hello,
> > > >>>>>>>>
> > > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> > > >>>> workers
> > > >>>>>> will
> > > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> > > >> shows
> > > >>>> in
> > > >>>>>> the
> > > >>>>>>>> logs) but are never marked as complete in the airflow DB or
> UI?
> > > >>>>>>> Effectively
> > > >>>>>>>> this causes tasks to hang even though they are complete, and
> > > >> the
> > > >>>> DAG
> > > >>>>>> will
> > > >>>>>>>> not continue.
> > > >>>>>>>>
> > > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> > > >>> have a
> > > >>>>>>>> workaround?
> > > >>>>>>>>
> > > >>>>>>>> Thanks!
> > > >>>>>>>>
> > > >>>>>>>> --
> > > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > >>>>>> Engagement
> > > >>>>>>> |
> > > >>>>>>>> New Relic
> > > >>>>>>>> ᐧ
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > --
> > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement
> |
> > New Relic
> >
>
-- 
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic

Re: Completed tasks not being marked as completed

Posted by Arthur Wiedmer <ar...@gmail.com>.

I have noticed the issue more prominently with Chrome after v54, I was
wondering if this was related at all. Somehow, airflow seems kinder on
Firefox.

Does anyone else have a similar experience?

Best,
Arthur

On Fri, Aug 4, 2017 at 12:02 PM, Marc Weil <mw...@newrelic.com> wrote:

> This is on version 1.8.0. I don't recall seeing this behavior when I was
> running 1.7.x.
> On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com> wrote:
>
> > What version of airflow? From the top of my mind 1.8.X
> >
> > 1) we do use db locking
> > 2) we check the state after we get the lock
> > 3) I don't think the task sets a state if it finds out it is running
> > somewhere else
> >
> > Maybe we do something at the executor/scheduler level. That I need to
> > investigate if this issue is on a recent version.
> >
> > Bolke
> >
> > Sent from my iPhone
> >
> > > On 4 Aug 2017, at 19:24, George Leslie-Waksman <
> george@cloverhealth.com.INVALID>
> > wrote:
> > >
> > > Pretty sure (not 100%) what is happening is:
> > >
> > >   1. Scheduler bugs result in task getting scheduled twice
> > >   2. Worker 1 grabs task
> > >   3. Worker 2 grabs task
> > >   4. Worker 1 starts task
> > >   5. Worker 2 starts task
> > >   6. Worker 2 sees that Worker 1 has started and plans to abort
> > >   7. Worker 1 finishes and marks task as done
> > >   8. Worker 2 finishes aborting and marks task as not done
> > >
> > >
> > >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:
> > >>
> > >> Hey Max,
> > >>
> > >> Thanks for the suggestions. I believe it was a retry (I'm using remote
> > >> logging so I can only check after the task completes), but the UI
> never
> > >> reported it as such. The latest_heartbeat column is in the jobs table,
> > and
> > >> interestingly I do see some running jobs that haven't heartbeated for
> > ~22
> > >> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> > >> listed as the executory class. I can't really correlate these to a
> > specific
> > >> task instance, however, as there doesn't appear to be any key written
> to
> > >> the jobs table (the dag_id column is all null, and there's no task_id
> > >> column or anything).
> > >>
> > >> Any ideas on what could be making these tasks stop heartbeating
> > regularly?
> > >> That could explain why eventually (after an overnight period of time)
> > >> everything is marked as finished in the Airflow UI: eventually these
> > tasks
> > >> do heartbeat again, but quite long after they are finished running.
> > >>
> > >> Thanks again!
> > >> ᐧ
> > >>
> > >> --
> > >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > Engagement |
> > >> New Relic
> > >>
> > >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> > >> maximebeauchemin@gmail.com> wrote:
> > >>
> > >>> Are you sure there hasn't been a retry at that point? [One of] the
> > >> expected
> > >>> behavior is the one I described, where if a task finished without
> > >> reporting
> > >>> it's success [or failure], it will stay marked as RUNNING, but will
> > fail
> > >> to
> > >>> emit a heartbeat (which is a timestamp updated in the task_instance
> > table
> > >>> as last_heartbeat or something).  The scheduler monitors for RUNNING
> > >> tasks
> > >>> without heartbeat and eventually will handle the failure (send
> emails,
> > >> call
> > >>> on_failure_callback, ...).
> > >>>
> > >>> Looking for heartbeat in the DB might give you some clues as to what
> is
> > >>> going on.
> > >>>
> > >>> Also there have been versions where we'd occasionally see double
> > >>> triggering, and double firing, which can be confusing. Then you can
> > have
> > >>> different processes reporting their status and debugging those issues
> > can
> > >>> be problematic. I think there's good prevention against that now,
> using
> > >>> database transactions as the task instance sets itself as RUNNING.
> I'm
> > >> not
> > >>> sure if 1.8.0 is 100% clean from that regard.
> > >>>
> > >>> Max
> > >>>
> > >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com>
> > wrote:
> > >>>>
> > >>>> It happens mostly when the scheduler is catching up. More
> > specifically,
> > >>>> when I load a brand new DAG with a start date in the past. Usually I
> > >> have
> > >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> > >> same
> > >>>> time.
> > >>>>
> > >>>> What I've also noticed is that the tasks will sit completed in
> reality
> > >>> but
> > >>>> uncompleted in the Airflow DB for many hours, but if I just leave
> them
> > >>> all
> > >>>> sitting there over night they all tend to be marked complete the
> next
> > >>>> morning. Perhaps this points to some sort of Celery timeout or
> > >> connection
> > >>>> retry interval?
> > >>>> ᐧ
> > >>>>
> > >>>> --
> > >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > >> Engagement
> > >>> |
> > >>>> New Relic
> > >>>>
> > >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > >>>> maximebeauchemin@gmail.com> wrote:
> > >>>>
> > >>>>> By the time "INFO - Task exited with return code 0" gets logged,
> the
> > >>> task
> > >>>>> should have been marked as successful by the subprocess. I have no
> > >>>> specific
> > >>>>> intuition as to what the issue may be.
> > >>>>>
> > >>>>> I'm guessing at that point the job stops emitting heartbeat and
> > >>>> eventually
> > >>>>> the scheduler will handle it as a failure?
> > >>>>>
> > >>>>> How often does that happen?
> > >>>>>
> > >>>>> Max
> > >>>>>
> > >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
> > >> wrote:
> > >>>>>
> > >>>>>> From what I can tell, it only affects CeleryExecutor. I've never
> > >> seen
> > >>>>> this
> > >>>>>> behavior with LocalExecutor before.
> > >>>>>>
> > >>>>>> Max, do you know anything about this type of failure mode?
> > >>>>>> ᐧ
> > >>>>>>
> > >>>>>> --
> > >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > >>>> Engagement
> > >>>>> |
> > >>>>>> New Relic
> > >>>>>>
> > >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> > >> thejonas@gmail.com>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> We have the exact same problem. In our case, it's a bash operator
> > >>>>>> starting
> > >>>>>>> a docker container. The container and process it ran exit, but
> > >> the
> > >>>>>> 'docker
> > >>>>>>> run' command is still showing up in the process table, waiting
> > >> for
> > >>> an
> > >>>>>>> event.
> > >>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
> > >>>>>>>
> > >>>>>>> _jonas
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hello,
> > >>>>>>>>
> > >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> > >>>> workers
> > >>>>>> will
> > >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> > >> shows
> > >>>> in
> > >>>>>> the
> > >>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
> > >>>>>>> Effectively
> > >>>>>>>> this causes tasks to hang even though they are complete, and
> > >> the
> > >>>> DAG
> > >>>>>> will
> > >>>>>>>> not continue.
> > >>>>>>>>
> > >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> > >>> have a
> > >>>>>>>> workaround?
> > >>>>>>>>
> > >>>>>>>> Thanks!
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > >>>>>> Engagement
> > >>>>>>> |
> > >>>>>>>> New Relic
> > >>>>>>>> ᐧ
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>

Re: Completed tasks not being marked as completed

Posted by Marc Weil <mw...@newrelic.com>.

This is on version 1.8.0. I don't recall seeing this behavior when I was
running 1.7.x.
On Fri, Aug 4, 2017 at 2:58 PM Bolke de Bruin <bd...@gmail.com> wrote:

> What version of airflow? From the top of my mind 1.8.X
>
> 1) we do use db locking
> 2) we check the state after we get the lock
> 3) I don't think the task sets a state if it finds out it is running
> somewhere else
>
> Maybe we do something at the executor/scheduler level. That I need to
> investigate if this issue is on a recent version.
>
> Bolke
>
> Sent from my iPhone
>
> > On 4 Aug 2017, at 19:24, George Leslie-Waksman <ge...@cloverhealth.com.INVALID>
> wrote:
> >
> > Pretty sure (not 100%) what is happening is:
> >
> >   1. Scheduler bugs result in task getting scheduled twice
> >   2. Worker 1 grabs task
> >   3. Worker 2 grabs task
> >   4. Worker 1 starts task
> >   5. Worker 2 starts task
> >   6. Worker 2 sees that Worker 1 has started and plans to abort
> >   7. Worker 1 finishes and marks task as done
> >   8. Worker 2 finishes aborting and marks task as not done
> >
> >
> >> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:
> >>
> >> Hey Max,
> >>
> >> Thanks for the suggestions. I believe it was a retry (I'm using remote
> >> logging so I can only check after the task completes), but the UI never
> >> reported it as such. The latest_heartbeat column is in the jobs table,
> and
> >> interestingly I do see some running jobs that haven't heartbeated for
> ~22
> >> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> >> listed as the executory class. I can't really correlate these to a
> specific
> >> task instance, however, as there doesn't appear to be any key written to
> >> the jobs table (the dag_id column is all null, and there's no task_id
> >> column or anything).
> >>
> >> Any ideas on what could be making these tasks stop heartbeating
> regularly?
> >> That could explain why eventually (after an overnight period of time)
> >> everything is marked as finished in the Airflow UI: eventually these
> tasks
> >> do heartbeat again, but quite long after they are finished running.
> >>
> >> Thanks again!
> >> ᐧ
> >>
> >> --
> >> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement |
> >> New Relic
> >>
> >> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> >> maximebeauchemin@gmail.com> wrote:
> >>
> >>> Are you sure there hasn't been a retry at that point? [One of] the
> >> expected
> >>> behavior is the one I described, where if a task finished without
> >> reporting
> >>> it's success [or failure], it will stay marked as RUNNING, but will
> fail
> >> to
> >>> emit a heartbeat (which is a timestamp updated in the task_instance
> table
> >>> as last_heartbeat or something).  The scheduler monitors for RUNNING
> >> tasks
> >>> without heartbeat and eventually will handle the failure (send emails,
> >> call
> >>> on_failure_callback, ...).
> >>>
> >>> Looking for heartbeat in the DB might give you some clues as to what is
> >>> going on.
> >>>
> >>> Also there have been versions where we'd occasionally see double
> >>> triggering, and double firing, which can be confusing. Then you can
> have
> >>> different processes reporting their status and debugging those issues
> can
> >>> be problematic. I think there's good prevention against that now, using
> >>> database transactions as the task instance sets itself as RUNNING. I'm
> >> not
> >>> sure if 1.8.0 is 100% clean from that regard.
> >>>
> >>> Max
> >>>
> >>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com>
> wrote:
> >>>>
> >>>> It happens mostly when the scheduler is catching up. More
> specifically,
> >>>> when I load a brand new DAG with a start date in the past. Usually I
> >> have
> >>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> >> same
> >>>> time.
> >>>>
> >>>> What I've also noticed is that the tasks will sit completed in reality
> >>> but
> >>>> uncompleted in the Airflow DB for many hours, but if I just leave them
> >>> all
> >>>> sitting there over night they all tend to be marked complete the next
> >>>> morning. Perhaps this points to some sort of Celery timeout or
> >> connection
> >>>> retry interval?
> >>>> ᐧ
> >>>>
> >>>> --
> >>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >> Engagement
> >>> |
> >>>> New Relic
> >>>>
> >>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> >>>> maximebeauchemin@gmail.com> wrote:
> >>>>
> >>>>> By the time "INFO - Task exited with return code 0" gets logged, the
> >>> task
> >>>>> should have been marked as successful by the subprocess. I have no
> >>>> specific
> >>>>> intuition as to what the issue may be.
> >>>>>
> >>>>> I'm guessing at that point the job stops emitting heartbeat and
> >>>> eventually
> >>>>> the scheduler will handle it as a failure?
> >>>>>
> >>>>> How often does that happen?
> >>>>>
> >>>>> Max
> >>>>>
> >>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
> >> wrote:
> >>>>>
> >>>>>> From what I can tell, it only affects CeleryExecutor. I've never
> >> seen
> >>>>> this
> >>>>>> behavior with LocalExecutor before.
> >>>>>>
> >>>>>> Max, do you know anything about this type of failure mode?
> >>>>>> ᐧ
> >>>>>>
> >>>>>> --
> >>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >>>> Engagement
> >>>>> |
> >>>>>> New Relic
> >>>>>>
> >>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> >> thejonas@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> We have the exact same problem. In our case, it's a bash operator
> >>>>>> starting
> >>>>>>> a docker container. The container and process it ran exit, but
> >> the
> >>>>>> 'docker
> >>>>>>> run' command is still showing up in the process table, waiting
> >> for
> >>> an
> >>>>>>> event.
> >>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
> >>>>>>>
> >>>>>>> _jonas
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
> >>>> workers
> >>>>>> will
> >>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
> >> shows
> >>>> in
> >>>>>> the
> >>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
> >>>>>>> Effectively
> >>>>>>>> this causes tasks to hang even though they are complete, and
> >> the
> >>>> DAG
> >>>>>> will
> >>>>>>>> not continue.
> >>>>>>>>
> >>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
> >>> have a
> >>>>>>>> workaround?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> >>>>>> Engagement
> >>>>>>> |
> >>>>>>>> New Relic
> >>>>>>>> ᐧ
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
-- 
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic

Re: Completed tasks not being marked as completed

Posted by Bolke de Bruin <bd...@gmail.com>.

What version of airflow? From the top of my mind 1.8.X

1) we do use db locking
2) we check the state after we get the lock
3) I don't think the task sets a state if it finds out it is running somewhere else

Maybe we do something at the executor/scheduler level. That I need to investigate if this issue is on a recent version. 

Bolke

Sent from my iPhone

> On 4 Aug 2017, at 19:24, George Leslie-Waksman <ge...@cloverhealth.com.INVALID> wrote:
> 
> Pretty sure (not 100%) what is happening is:
> 
>   1. Scheduler bugs result in task getting scheduled twice
>   2. Worker 1 grabs task
>   3. Worker 2 grabs task
>   4. Worker 1 starts task
>   5. Worker 2 starts task
>   6. Worker 2 sees that Worker 1 has started and plans to abort
>   7. Worker 1 finishes and marks task as done
>   8. Worker 2 finishes aborting and marks task as not done
> 
> 
>> On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:
>> 
>> Hey Max,
>> 
>> Thanks for the suggestions. I believe it was a retry (I'm using remote
>> logging so I can only check after the task completes), but the UI never
>> reported it as such. The latest_heartbeat column is in the jobs table, and
>> interestingly I do see some running jobs that haven't heartbeated for ~22
>> minutes. They are LocalTaskJob instances with CeleryExecutor properly
>> listed as the executory class. I can't really correlate these to a specific
>> task instance, however, as there doesn't appear to be any key written to
>> the jobs table (the dag_id column is all null, and there's no task_id
>> column or anything).
>> 
>> Any ideas on what could be making these tasks stop heartbeating regularly?
>> That could explain why eventually (after an overnight period of time)
>> everything is marked as finished in the Airflow UI: eventually these tasks
>> do heartbeat again, but quite long after they are finished running.
>> 
>> Thanks again!
>> ᐧ
>> 
>> --
>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
>> New Relic
>> 
>> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
>> maximebeauchemin@gmail.com> wrote:
>> 
>>> Are you sure there hasn't been a retry at that point? [One of] the
>> expected
>>> behavior is the one I described, where if a task finished without
>> reporting
>>> it's success [or failure], it will stay marked as RUNNING, but will fail
>> to
>>> emit a heartbeat (which is a timestamp updated in the task_instance table
>>> as last_heartbeat or something).  The scheduler monitors for RUNNING
>> tasks
>>> without heartbeat and eventually will handle the failure (send emails,
>> call
>>> on_failure_callback, ...).
>>> 
>>> Looking for heartbeat in the DB might give you some clues as to what is
>>> going on.
>>> 
>>> Also there have been versions where we'd occasionally see double
>>> triggering, and double firing, which can be confusing. Then you can have
>>> different processes reporting their status and debugging those issues can
>>> be problematic. I think there's good prevention against that now, using
>>> database transactions as the task instance sets itself as RUNNING. I'm
>> not
>>> sure if 1.8.0 is 100% clean from that regard.
>>> 
>>> Max
>>> 
>>>> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com> wrote:
>>>> 
>>>> It happens mostly when the scheduler is catching up. More specifically,
>>>> when I load a brand new DAG with a start date in the past. Usually I
>> have
>>>> it set to run 5 DAG runs at the same time, and up to 16 tasks at the
>> same
>>>> time.
>>>> 
>>>> What I've also noticed is that the tasks will sit completed in reality
>>> but
>>>> uncompleted in the Airflow DB for many hours, but if I just leave them
>>> all
>>>> sitting there over night they all tend to be marked complete the next
>>>> morning. Perhaps this points to some sort of Celery timeout or
>> connection
>>>> retry interval?
>>>> ᐧ
>>>> 
>>>> --
>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>> Engagement
>>> |
>>>> New Relic
>>>> 
>>>> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
>>>> maximebeauchemin@gmail.com> wrote:
>>>> 
>>>>> By the time "INFO - Task exited with return code 0" gets logged, the
>>> task
>>>>> should have been marked as successful by the subprocess. I have no
>>>> specific
>>>>> intuition as to what the issue may be.
>>>>> 
>>>>> I'm guessing at that point the job stops emitting heartbeat and
>>>> eventually
>>>>> the scheduler will handle it as a failure?
>>>>> 
>>>>> How often does that happen?
>>>>> 
>>>>> Max
>>>>> 
>>>>> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
>> wrote:
>>>>> 
>>>>>> From what I can tell, it only affects CeleryExecutor. I've never
>> seen
>>>>> this
>>>>>> behavior with LocalExecutor before.
>>>>>> 
>>>>>> Max, do you know anything about this type of failure mode?
>>>>>> ᐧ
>>>>>> 
>>>>>> --
>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>> Engagement
>>>>> |
>>>>>> New Relic
>>>>>> 
>>>>>> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
>> thejonas@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> We have the exact same problem. In our case, it's a bash operator
>>>>>> starting
>>>>>>> a docker container. The container and process it ran exit, but
>> the
>>>>>> 'docker
>>>>>>> run' command is still showing up in the process table, waiting
>> for
>>> an
>>>>>>> event.
>>>>>>> I'm trying to switch to LocalExecutor to see if that will help.
>>>>>>> 
>>>>>>> _jonas
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> Hello,
>>>>>>>> 
>>>>>>>> Has anyone seen the behavior when using CeleryExecutor where
>>>> workers
>>>>>> will
>>>>>>>> finish their tasks ("INFO - Task exited with return code 0"
>> shows
>>>> in
>>>>>> the
>>>>>>>> logs) but are never marked as complete in the airflow DB or UI?
>>>>>>> Effectively
>>>>>>>> this causes tasks to hang even though they are complete, and
>> the
>>>> DAG
>>>>>> will
>>>>>>>> not continue.
>>>>>>>> 
>>>>>>>> This is happening on 1.8.0. Anyone else seen this or perhaps
>>> have a
>>>>>>>> workaround?
>>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Marc Weil | Lead Engineer | Growth Automation, Marketing, and
>>>>>> Engagement
>>>>>>> |
>>>>>>>> New Relic
>>>>>>>> ᐧ
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Completed tasks not being marked as completed

Posted by George Leslie-Waksman <ge...@cloverhealth.com.INVALID>.

Pretty sure (not 100%) what is happening is:

   1. Scheduler bugs result in task getting scheduled twice
   2. Worker 1 grabs task
   3. Worker 2 grabs task
   4. Worker 1 starts task
   5. Worker 2 starts task
   6. Worker 2 sees that Worker 1 has started and plans to abort
   7. Worker 1 finishes and marks task as done
   8. Worker 2 finishes aborting and marks task as not done


On Fri, Jul 28, 2017 at 3:50 PM Marc Weil <mw...@newrelic.com> wrote:

> Hey Max,
>
> Thanks for the suggestions. I believe it was a retry (I'm using remote
> logging so I can only check after the task completes), but the UI never
> reported it as such. The latest_heartbeat column is in the jobs table, and
> interestingly I do see some running jobs that haven't heartbeated for ~22
> minutes. They are LocalTaskJob instances with CeleryExecutor properly
> listed as the executory class. I can't really correlate these to a specific
> task instance, however, as there doesn't appear to be any key written to
> the jobs table (the dag_id column is all null, and there's no task_id
> column or anything).
>
> Any ideas on what could be making these tasks stop heartbeating regularly?
> That could explain why eventually (after an overnight period of time)
> everything is marked as finished in the Airflow UI: eventually these tasks
> do heartbeat again, but quite long after they are finished running.
>
> Thanks again!
> ᐧ
>
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>
> On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Are you sure there hasn't been a retry at that point? [One of] the
> expected
> > behavior is the one I described, where if a task finished without
> reporting
> > it's success [or failure], it will stay marked as RUNNING, but will fail
> to
> > emit a heartbeat (which is a timestamp updated in the task_instance table
> > as last_heartbeat or something).  The scheduler monitors for RUNNING
> tasks
> > without heartbeat and eventually will handle the failure (send emails,
> call
> > on_failure_callback, ...).
> >
> > Looking for heartbeat in the DB might give you some clues as to what is
> > going on.
> >
> > Also there have been versions where we'd occasionally see double
> > triggering, and double firing, which can be confusing. Then you can have
> > different processes reporting their status and debugging those issues can
> > be problematic. I think there's good prevention against that now, using
> > database transactions as the task instance sets itself as RUNNING. I'm
> not
> > sure if 1.8.0 is 100% clean from that regard.
> >
> > Max
> >
> > On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com> wrote:
> >
> > > It happens mostly when the scheduler is catching up. More specifically,
> > > when I load a brand new DAG with a start date in the past. Usually I
> have
> > > it set to run 5 DAG runs at the same time, and up to 16 tasks at the
> same
> > > time.
> > >
> > > What I've also noticed is that the tasks will sit completed in reality
> > but
> > > uncompleted in the Airflow DB for many hours, but if I just leave them
> > all
> > > sitting there over night they all tend to be marked complete the next
> > > morning. Perhaps this points to some sort of Celery timeout or
> connection
> > > retry interval?
> > > ᐧ
> > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> > > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com> wrote:
> > >
> > > > By the time "INFO - Task exited with return code 0" gets logged, the
> > task
> > > > should have been marked as successful by the subprocess. I have no
> > > specific
> > > > intuition as to what the issue may be.
> > > >
> > > > I'm guessing at that point the job stops emitting heartbeat and
> > > eventually
> > > > the scheduler will handle it as a failure?
> > > >
> > > > How often does that happen?
> > > >
> > > > Max
> > > >
> > > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com>
> wrote:
> > > >
> > > > > From what I can tell, it only affects CeleryExecutor. I've never
> seen
> > > > this
> > > > > behavior with LocalExecutor before.
> > > > >
> > > > > Max, do you know anything about this type of failure mode?
> > > > > ᐧ
> > > > >
> > > > > --
> > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > Engagement
> > > > |
> > > > > New Relic
> > > > >
> > > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <
> thejonas@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > We have the exact same problem. In our case, it's a bash operator
> > > > > starting
> > > > > > a docker container. The container and process it ran exit, but
> the
> > > > > 'docker
> > > > > > run' command is still showing up in the process table, waiting
> for
> > an
> > > > > > event.
> > > > > > I'm trying to switch to LocalExecutor to see if that will help.
> > > > > >
> > > > > > _jonas
> > > > > >
> > > > > >
> > > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > Has anyone seen the behavior when using CeleryExecutor where
> > > workers
> > > > > will
> > > > > > > finish their tasks ("INFO - Task exited with return code 0"
> shows
> > > in
> > > > > the
> > > > > > > logs) but are never marked as complete in the airflow DB or UI?
> > > > > > Effectively
> > > > > > > this causes tasks to hang even though they are complete, and
> the
> > > DAG
> > > > > will
> > > > > > > not continue.
> > > > > > >
> > > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps
> > have a
> > > > > > > workaround?
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > --
> > > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > > Engagement
> > > > > > |
> > > > > > > New Relic
> > > > > > > ᐧ
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Completed tasks not being marked as completed

Posted by Marc Weil <mw...@newrelic.com>.

Hey Max,

Thanks for the suggestions. I believe it was a retry (I'm using remote
logging so I can only check after the task completes), but the UI never
reported it as such. The latest_heartbeat column is in the jobs table, and
interestingly I do see some running jobs that haven't heartbeated for ~22
minutes. They are LocalTaskJob instances with CeleryExecutor properly
listed as the executory class. I can't really correlate these to a specific
task instance, however, as there doesn't appear to be any key written to
the jobs table (the dag_id column is all null, and there's no task_id
column or anything).

Any ideas on what could be making these tasks stop heartbeating regularly?
That could explain why eventually (after an overnight period of time)
everything is marked as finished in the Airflow UI: eventually these tasks
do heartbeat again, but quite long after they are finished running.

Thanks again!
ᐧ

--
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic

On Fri, Jul 28, 2017 at 3:15 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Are you sure there hasn't been a retry at that point? [One of] the expected
> behavior is the one I described, where if a task finished without reporting
> it's success [or failure], it will stay marked as RUNNING, but will fail to
> emit a heartbeat (which is a timestamp updated in the task_instance table
> as last_heartbeat or something).  The scheduler monitors for RUNNING tasks
> without heartbeat and eventually will handle the failure (send emails, call
> on_failure_callback, ...).
>
> Looking for heartbeat in the DB might give you some clues as to what is
> going on.
>
> Also there have been versions where we'd occasionally see double
> triggering, and double firing, which can be confusing. Then you can have
> different processes reporting their status and debugging those issues can
> be problematic. I think there's good prevention against that now, using
> database transactions as the task instance sets itself as RUNNING. I'm not
> sure if 1.8.0 is 100% clean from that regard.
>
> Max
>
> On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com> wrote:
>
> > It happens mostly when the scheduler is catching up. More specifically,
> > when I load a brand new DAG with a start date in the past. Usually I have
> > it set to run 5 DAG runs at the same time, and up to 16 tasks at the same
> > time.
> >
> > What I've also noticed is that the tasks will sit completed in reality
> but
> > uncompleted in the Airflow DB for many hours, but if I just leave them
> all
> > sitting there over night they all tend to be marked complete the next
> > morning. Perhaps this points to some sort of Celery timeout or connection
> > retry interval?
> > ᐧ
> >
> > --
> > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement
> |
> > New Relic
> >
> > On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> > > By the time "INFO - Task exited with return code 0" gets logged, the
> task
> > > should have been marked as successful by the subprocess. I have no
> > specific
> > > intuition as to what the issue may be.
> > >
> > > I'm guessing at that point the job stops emitting heartbeat and
> > eventually
> > > the scheduler will handle it as a failure?
> > >
> > > How often does that happen?
> > >
> > > Max
> > >
> > > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com> wrote:
> > >
> > > > From what I can tell, it only affects CeleryExecutor. I've never seen
> > > this
> > > > behavior with LocalExecutor before.
> > > >
> > > > Max, do you know anything about this type of failure mode?
> > > > ᐧ
> > > >
> > > > --
> > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > Engagement
> > > |
> > > > New Relic
> > > >
> > > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <th...@gmail.com>
> > > > wrote:
> > > >
> > > > > We have the exact same problem. In our case, it's a bash operator
> > > > starting
> > > > > a docker container. The container and process it ran exit, but the
> > > > 'docker
> > > > > run' command is still showing up in the process table, waiting for
> an
> > > > > event.
> > > > > I'm trying to switch to LocalExecutor to see if that will help.
> > > > >
> > > > > _jonas
> > > > >
> > > > >
> > > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Has anyone seen the behavior when using CeleryExecutor where
> > workers
> > > > will
> > > > > > finish their tasks ("INFO - Task exited with return code 0" shows
> > in
> > > > the
> > > > > > logs) but are never marked as complete in the airflow DB or UI?
> > > > > Effectively
> > > > > > this causes tasks to hang even though they are complete, and the
> > DAG
> > > > will
> > > > > > not continue.
> > > > > >
> > > > > > This is happening on 1.8.0. Anyone else seen this or perhaps
> have a
> > > > > > workaround?
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > --
> > > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > > Engagement
> > > > > |
> > > > > > New Relic
> > > > > > ᐧ
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Completed tasks not being marked as completed

Posted by Maxime Beauchemin <ma...@gmail.com>.

Are you sure there hasn't been a retry at that point? [One of] the expected
behavior is the one I described, where if a task finished without reporting
it's success [or failure], it will stay marked as RUNNING, but will fail to
emit a heartbeat (which is a timestamp updated in the task_instance table
as last_heartbeat or something).  The scheduler monitors for RUNNING tasks
without heartbeat and eventually will handle the failure (send emails, call
on_failure_callback, ...).

Looking for heartbeat in the DB might give you some clues as to what is
going on.

Also there have been versions where we'd occasionally see double
triggering, and double firing, which can be confusing. Then you can have
different processes reporting their status and debugging those issues can
be problematic. I think there's good prevention against that now, using
database transactions as the task instance sets itself as RUNNING. I'm not
sure if 1.8.0 is 100% clean from that regard.

Max

On Fri, Jul 28, 2017 at 10:01 AM, Marc Weil <mw...@newrelic.com> wrote:

> It happens mostly when the scheduler is catching up. More specifically,
> when I load a brand new DAG with a start date in the past. Usually I have
> it set to run 5 DAG runs at the same time, and up to 16 tasks at the same
> time.
>
> What I've also noticed is that the tasks will sit completed in reality but
> uncompleted in the Airflow DB for many hours, but if I just leave them all
> sitting there over night they all tend to be marked complete the next
> morning. Perhaps this points to some sort of Celery timeout or connection
> retry interval?
> ᐧ
>
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>
> On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > By the time "INFO - Task exited with return code 0" gets logged, the task
> > should have been marked as successful by the subprocess. I have no
> specific
> > intuition as to what the issue may be.
> >
> > I'm guessing at that point the job stops emitting heartbeat and
> eventually
> > the scheduler will handle it as a failure?
> >
> > How often does that happen?
> >
> > Max
> >
> > On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com> wrote:
> >
> > > From what I can tell, it only affects CeleryExecutor. I've never seen
> > this
> > > behavior with LocalExecutor before.
> > >
> > > Max, do you know anything about this type of failure mode?
> > > ᐧ
> > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > >
> > > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <th...@gmail.com>
> > > wrote:
> > >
> > > > We have the exact same problem. In our case, it's a bash operator
> > > starting
> > > > a docker container. The container and process it ran exit, but the
> > > 'docker
> > > > run' command is still showing up in the process table, waiting for an
> > > > event.
> > > > I'm trying to switch to LocalExecutor to see if that will help.
> > > >
> > > > _jonas
> > > >
> > > >
> > > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com>
> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > Has anyone seen the behavior when using CeleryExecutor where
> workers
> > > will
> > > > > finish their tasks ("INFO - Task exited with return code 0" shows
> in
> > > the
> > > > > logs) but are never marked as complete in the airflow DB or UI?
> > > > Effectively
> > > > > this causes tasks to hang even though they are complete, and the
> DAG
> > > will
> > > > > not continue.
> > > > >
> > > > > This is happening on 1.8.0. Anyone else seen this or perhaps have a
> > > > > workaround?
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --
> > > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > > Engagement
> > > > |
> > > > > New Relic
> > > > > ᐧ
> > > > >
> > > >
> > >
> >
>

Re: Completed tasks not being marked as completed

Posted by Marc Weil <mw...@newrelic.com>.

It happens mostly when the scheduler is catching up. More specifically,
when I load a brand new DAG with a start date in the past. Usually I have
it set to run 5 DAG runs at the same time, and up to 16 tasks at the same
time.

What I've also noticed is that the tasks will sit completed in reality but
uncompleted in the Airflow DB for many hours, but if I just leave them all
sitting there over night they all tend to be marked complete the next
morning. Perhaps this points to some sort of Celery timeout or connection
retry interval?
ᐧ

--
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic

On Fri, Jul 28, 2017 at 9:58 AM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> By the time "INFO - Task exited with return code 0" gets logged, the task
> should have been marked as successful by the subprocess. I have no specific
> intuition as to what the issue may be.
>
> I'm guessing at that point the job stops emitting heartbeat and eventually
> the scheduler will handle it as a failure?
>
> How often does that happen?
>
> Max
>
> On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com> wrote:
>
> > From what I can tell, it only affects CeleryExecutor. I've never seen
> this
> > behavior with LocalExecutor before.
> >
> > Max, do you know anything about this type of failure mode?
> > ᐧ
> >
> > --
> > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement
> |
> > New Relic
> >
> > On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <th...@gmail.com>
> > wrote:
> >
> > > We have the exact same problem. In our case, it's a bash operator
> > starting
> > > a docker container. The container and process it ran exit, but the
> > 'docker
> > > run' command is still showing up in the process table, waiting for an
> > > event.
> > > I'm trying to switch to LocalExecutor to see if that will help.
> > >
> > > _jonas
> > >
> > >
> > > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > Has anyone seen the behavior when using CeleryExecutor where workers
> > will
> > > > finish their tasks ("INFO - Task exited with return code 0" shows in
> > the
> > > > logs) but are never marked as complete in the airflow DB or UI?
> > > Effectively
> > > > this causes tasks to hang even though they are complete, and the DAG
> > will
> > > > not continue.
> > > >
> > > > This is happening on 1.8.0. Anyone else seen this or perhaps have a
> > > > workaround?
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> > Engagement
> > > |
> > > > New Relic
> > > > ᐧ
> > > >
> > >
> >
>

Re: Completed tasks not being marked as completed

Posted by Maxime Beauchemin <ma...@gmail.com>.

By the time "INFO - Task exited with return code 0" gets logged, the task
should have been marked as successful by the subprocess. I have no specific
intuition as to what the issue may be.

I'm guessing at that point the job stops emitting heartbeat and eventually
the scheduler will handle it as a failure?

How often does that happen?

Max

On Fri, Jul 28, 2017 at 9:43 AM, Marc Weil <mw...@newrelic.com> wrote:

> From what I can tell, it only affects CeleryExecutor. I've never seen this
> behavior with LocalExecutor before.
>
> Max, do you know anything about this type of failure mode?
> ᐧ
>
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
>
> On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <th...@gmail.com>
> wrote:
>
> > We have the exact same problem. In our case, it's a bash operator
> starting
> > a docker container. The container and process it ran exit, but the
> 'docker
> > run' command is still showing up in the process table, waiting for an
> > event.
> > I'm trying to switch to LocalExecutor to see if that will help.
> >
> > _jonas
> >
> >
> > On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com> wrote:
> >
> > > Hello,
> > >
> > > Has anyone seen the behavior when using CeleryExecutor where workers
> will
> > > finish their tasks ("INFO - Task exited with return code 0" shows in
> the
> > > logs) but are never marked as complete in the airflow DB or UI?
> > Effectively
> > > this causes tasks to hang even though they are complete, and the DAG
> will
> > > not continue.
> > >
> > > This is happening on 1.8.0. Anyone else seen this or perhaps have a
> > > workaround?
> > >
> > > Thanks!
> > >
> > > --
> > > Marc Weil | Lead Engineer | Growth Automation, Marketing, and
> Engagement
> > |
> > > New Relic
> > > ᐧ
> > >
> >
>

Re: Completed tasks not being marked as completed

Posted by Marc Weil <mw...@newrelic.com>.

From what I can tell, it only affects CeleryExecutor. I've never seen this
behavior with LocalExecutor before.

Max, do you know anything about this type of failure mode?
ᐧ

--
Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
New Relic

On Fri, Jul 28, 2017 at 5:48 AM, Jonas Karlsson <th...@gmail.com> wrote:

> We have the exact same problem. In our case, it's a bash operator starting
> a docker container. The container and process it ran exit, but the 'docker
> run' command is still showing up in the process table, waiting for an
> event.
> I'm trying to switch to LocalExecutor to see if that will help.
>
> _jonas
>
>
> On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com> wrote:
>
> > Hello,
> >
> > Has anyone seen the behavior when using CeleryExecutor where workers will
> > finish their tasks ("INFO - Task exited with return code 0" shows in the
> > logs) but are never marked as complete in the airflow DB or UI?
> Effectively
> > this causes tasks to hang even though they are complete, and the DAG will
> > not continue.
> >
> > This is happening on 1.8.0. Anyone else seen this or perhaps have a
> > workaround?
> >
> > Thanks!
> >
> > --
> > Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement
> |
> > New Relic
> > ᐧ
> >
>

Re: Completed tasks not being marked as completed

Posted by Jonas Karlsson <th...@gmail.com>.

We have the exact same problem. In our case, it's a bash operator starting
a docker container. The container and process it ran exit, but the 'docker
run' command is still showing up in the process table, waiting for an event.
I'm trying to switch to LocalExecutor to see if that will help.

_jonas

On Thu, Jul 27, 2017 at 4:28 PM Marc Weil <mw...@newrelic.com> wrote:

> Hello,
>
> Has anyone seen the behavior when using CeleryExecutor where workers will
> finish their tasks ("INFO - Task exited with return code 0" shows in the
> logs) but are never marked as complete in the airflow DB or UI? Effectively
> this causes tasks to hang even though they are complete, and the DAG will
> not continue.
>
> This is happening on 1.8.0. Anyone else seen this or perhaps have a
> workaround?
>
> Thanks!
>
> --
> Marc Weil | Lead Engineer | Growth Automation, Marketing, and Engagement |
> New Relic
> ᐧ
>