You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by James Meickle <jm...@quantopian.com.INVALID> on 2019/07/31 15:43:20 UTC

Removal of "run_duration" and its impact on orphaned tasks

In my testing of 1.10.4rc3, I discovered that we were getting hit by a
process leak bug (which Ash has since fixed in 1.10.4rc4). This process
leak was minimal impact for most users, but was exacerbated in our case by
using "run_duration" to restart the scheduler every 10 minutes.

To mitigate that issue while remaining on the RC, we removed the use of
"run_duration", since it is deprecated as of master anyways:
https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration

Unfortunately, testing on our cluster (1.10.4rc3 plus no "run_duration")
has revealed that while the process leak issue was mitigated, that we're
now facing issues with orphaned tasks. These tasks are marked as
"scheduled" by the scheduler, but _not_ successfully queued in Celery even
after multiple scheduler loops. Around ~24h after last restart, we start
having enough stuck tasks that the system starts paging and requires a
manual restart.

Rather than generic "scheduler instability", this specific issue was one of
the reasons why we'd originally added the scheduler restart. But it appears
that on master, the orphaned task detection code still only runs on
scheduler start despite removing "run_duration":
https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328

Rather than immediately filing an issue I wanted to inquire a bit more
about why this orphan detection code is only run on scheduler start,
whether it would be safe to send in a PR to run it more often (maybe a
tunable parameter?), and if there's some other configuration issue with
Celery (in our case, backed by AWS Elasticache) that would cause us to see
orphaned tasks frequently.

Re: Removal of "run_duration" and its impact on orphaned tasks

Posted by Bolke de Bruin <bd...@gmail.com>.
James

Could you log this with debug on for SchedulerJob? There is only one place
where the number of tasks can go lower and thats when the executor reports
having the task. That's logged at debug level. Are you collecting metrics
from Airflow? If so I can create a small patch that will track tasks up for
execution, tasks at executor, and tasks executed.

cheers
Bolke

Op wo 31 jul. 2019 om 23:08 schreef James Meickle
<jm...@quantopian.com.invalid>:

> Yes, we use the Celery executor.
>
> To clarify, the tasks hadn't been running on workers for a long time, or
> even successfully submitted to Celery, so it's not a case where they got
> queued and then lost after some period of time.
>
> This happened shortly after UTC midnight, when we launch most of our tasks
> (note the UTC timestamp of the server having recently rolled over, vs. the
> execution date on the tasks _mostly_ being the previous date, which would
> be just launching at rollover)
>
> [2019-07-31 01:54:15,783] {{scheduler_job.py:889}} INFO - 21 tasks up for
> execution:
> <TaskInstance: canary.XXX 2019-07-31 00:00:00+00:00 [scheduled]>
> <TaskInstance: trading_files.YYY 2019-07-29 00:00:00+00:00 [scheduled]>
> <TaskInstance: pnl_reporting.AAA 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: pnl_reporting.BBB 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: pnl_reporting.CCC 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: pnl_reporting.DDD 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: pnl_reporting.EEE 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: us_short_volume.FFF 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: us_short_volume.GGG 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: us_short_volume.HHH 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: poc_noop_nightly_portfolio.III 2019-07-30 00:00:00+00:00
> [scheduled]>
> <TaskInstance: poc_noop_nightly_portfolio.JJJ 2019-07-30 00:00:00+00:00
> [scheduled]>
> <TaskInstance: poc_noop_nightly_portfolio.KKK 2019-07-30 00:00:00+00:00
> [scheduled]>
> <TaskInstance: poc_noop_nightly_portfolio.LLL 2019-07-30 00:00:00+00:00
> [scheduled]>
> <TaskInstance: poc_noop_nightly_portfolio.MMM 2019-07-30 00:00:00+00:00
> [scheduled]>
> <TaskInstance: poc_noop_nightly_portfolio.NNN 2019-07-30 00:00:00+00:00
> [scheduled]>
> <TaskInstance: us_short_volume.OOO 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: us_short_volume.PPP 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: us_short_volume.QQQ 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: us_short_volume.RRR 2019-07-30 00:00:00+00:00 [scheduled]>
> <TaskInstance: nightly_portfolio.SSS 2019-07-30 00:00:00+00:00 [scheduled]>
> [2019-07-31 01:54:15,790] {{scheduler_job.py:921}} INFO - Figuring out
> tasks to run in Pool(name=default_pool) with 90 open slots and 14 task
> instances ready to be queued
> (snip)
> [2019-07-31 01:54:15,807] {{scheduler_job.py:921}} INFO - Figuring out
> tasks to run in Pool(name=short_volume) with 7 open slots and 7 task
> instances ready to be queued
> (snip)
> [2019-07-31 01:54:15,818] {{scheduler_job.py:992}} INFO - Setting the
> follow tasks to queued state:
>
> (blank newline and nothing else, then the log starts repeating)
>
> On Wed, Jul 31, 2019 at 2:51 PM Bolke de Bruin <bd...@gmail.com> wrote:
>
> > Is this all with celery? Afaik Lyft runs with celery? Also if I remember
> > correctly the Google guys had a fix for this but that hasn't been
> > upstreamed yet?
> >
> > With celery task do get "lost" after a while with a certain setting (on a
> > phone so don't have it handy, we do set a higher default)
> >
> > Can you check when those tasks got into "scheduled" and what the time
> > difference is with "now"?
> >
> > B.
> >
> > Sent from my iPhone
> >
> > > On 31 Jul 2019, at 20:56, James Meickle <jmeickle@quantopian.com
> .invalid>
> > wrote:
> > >
> > > Ash:
> > >
> > > We definitely don't run thousands of tasks. Looks like it's closer to
> 300
> > > per execution date (and then retries), if I'm using the TI browser
> right.
> > >
> > > In my case, I found 21 tasks in "scheduled" state after 1 day of not
> > > restarting. One of our hourly "canary" DAGs got included in the
> pile-up -
> > > so it didn't run that hour as expected, so I got paged. (But it wasn't
> > just
> > > canary tasks, the other 20 tasks were all real and important workflows
> > that
> > > were not getting scheduled.)
> > >
> > > If we do change the scheduler to have a "cleanup" step within the loop
> > > instead of pre/post loop, I'd suggest we should:
> > > - Make the time between cleanups a configurable parameter
> > > - Log what cleanup steps are being taken and how long they take
> > > - Add new statsd metrics around cleanups (like "number of orphans
> > reset"),
> > > to help us understand when and why this happens
> > >
> > >
> > >
> > >> On Wed, Jul 31, 2019 at 1:25 PM Tao Feng <fe...@gmail.com> wrote:
> > >>
> > >> Late in the game as I was't aware of `run_duration` option been
> removed.
> > >> But just want to point out that Lyft also did very similar with James'
> > >> setup that we run the scheduler in a fix internal instead of infinite
> > loop
> > >> and let the runit/supervisor to restart the scheduler process. This is
> > to
> > >> solve: 1. orphaned tasks not getting clean up successfully when it
> runs
> > on
> > >> infinite loop; 2. Make sure stale / deleted DAG will get clean up(
> > >>
> > >>
> >
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438
> > >> ?
> > >> <
> >
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438
> > ?>
> > >> )
> > >> properly.
> > >>
> > >> I think if it goes with removing this option and let scheduler runs in
> > an
> > >> infinite loop, we need to change the schedule loop to handle the clean
> > up
> > >> process if it hasn't been done.
> > >>
> > >>> On Wed, Jul 31, 2019 at 10:10 AM Ash Berlin-Taylor <as...@apache.org>
> > wrote:
> > >>>
> > >>> Thanks for testing this out James, shame to discover we still have
> > >>> problems in that area. Do you have an idea of how many tasks per day
> we
> > >> are
> > >>> talking about here?
> > >>>
> > >>> Your cluster schedules quite a large number of tasks over the day (in
> > the
> > >>> 1k-10k range?) right?
> > >>>
> > >>> I'd say whatever causes a task to become orphaned _while_ the
> scheduler
> > >> is
> > >>> still running is the actual bug, and running the orphan detection
> more
> > >>> often may just be replacing one patch (the run duration) with another
> > one
> > >>> (running the orphan detection more than at start up).
> > >>>
> > >>> -ash
> > >>>
> > >>>> On 31 Jul 2019, at 16:43, James Meickle <jmeickle@quantopian.com
> > >> .INVALID>
> > >>> wrote:
> > >>>>
> > >>>> In my testing of 1.10.4rc3, I discovered that we were getting hit
> by a
> > >>>> process leak bug (which Ash has since fixed in 1.10.4rc4). This
> > process
> > >>>> leak was minimal impact for most users, but was exacerbated in our
> > case
> > >>> by
> > >>>> using "run_duration" to restart the scheduler every 10 minutes.
> > >>>>
> > >>>> To mitigate that issue while remaining on the RC, we removed the use
> > of
> > >>>> "run_duration", since it is deprecated as of master anyways:
> > >>>>
> > >>>
> > >>
> >
> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration
> > >>>>
> > >>>> Unfortunately, testing on our cluster (1.10.4rc3 plus no
> > >> "run_duration")
> > >>>> has revealed that while the process leak issue was mitigated, that
> > >> we're
> > >>>> now facing issues with orphaned tasks. These tasks are marked as
> > >>>> "scheduled" by the scheduler, but _not_ successfully queued in
> Celery
> > >>> even
> > >>>> after multiple scheduler loops. Around ~24h after last restart, we
> > >> start
> > >>>> having enough stuck tasks that the system starts paging and
> requires a
> > >>>> manual restart.
> > >>>>
> > >>>> Rather than generic "scheduler instability", this specific issue was
> > >> one
> > >>> of
> > >>>> the reasons why we'd originally added the scheduler restart. But it
> > >>> appears
> > >>>> that on master, the orphaned task detection code still only runs on
> > >>>> scheduler start despite removing "run_duration":
> > >>>>
> > >>>
> > >>
> >
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328
> > >>>>
> > >>>> Rather than immediately filing an issue I wanted to inquire a bit
> more
> > >>>> about why this orphan detection code is only run on scheduler start,
> > >>>> whether it would be safe to send in a PR to run it more often
> (maybe a
> > >>>> tunable parameter?), and if there's some other configuration issue
> > with
> > >>>> Celery (in our case, backed by AWS Elasticache) that would cause us
> to
> > >>> see
> > >>>> orphaned tasks frequently.
> > >>>
> > >>>
> > >>
> >
>


-- 

--
Bolke de Bruin
bdbruin@gmail.com

Re: Removal of "run_duration" and its impact on orphaned tasks

Posted by James Meickle <jm...@quantopian.com.INVALID>.
Yes, we use the Celery executor.

To clarify, the tasks hadn't been running on workers for a long time, or
even successfully submitted to Celery, so it's not a case where they got
queued and then lost after some period of time.

This happened shortly after UTC midnight, when we launch most of our tasks
(note the UTC timestamp of the server having recently rolled over, vs. the
execution date on the tasks _mostly_ being the previous date, which would
be just launching at rollover)

[2019-07-31 01:54:15,783] {{scheduler_job.py:889}} INFO - 21 tasks up for
execution:
<TaskInstance: canary.XXX 2019-07-31 00:00:00+00:00 [scheduled]>
<TaskInstance: trading_files.YYY 2019-07-29 00:00:00+00:00 [scheduled]>
<TaskInstance: pnl_reporting.AAA 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: pnl_reporting.BBB 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: pnl_reporting.CCC 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: pnl_reporting.DDD 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: pnl_reporting.EEE 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: us_short_volume.FFF 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: us_short_volume.GGG 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: us_short_volume.HHH 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: poc_noop_nightly_portfolio.III 2019-07-30 00:00:00+00:00
[scheduled]>
<TaskInstance: poc_noop_nightly_portfolio.JJJ 2019-07-30 00:00:00+00:00
[scheduled]>
<TaskInstance: poc_noop_nightly_portfolio.KKK 2019-07-30 00:00:00+00:00
[scheduled]>
<TaskInstance: poc_noop_nightly_portfolio.LLL 2019-07-30 00:00:00+00:00
[scheduled]>
<TaskInstance: poc_noop_nightly_portfolio.MMM 2019-07-30 00:00:00+00:00
[scheduled]>
<TaskInstance: poc_noop_nightly_portfolio.NNN 2019-07-30 00:00:00+00:00
[scheduled]>
<TaskInstance: us_short_volume.OOO 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: us_short_volume.PPP 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: us_short_volume.QQQ 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: us_short_volume.RRR 2019-07-30 00:00:00+00:00 [scheduled]>
<TaskInstance: nightly_portfolio.SSS 2019-07-30 00:00:00+00:00 [scheduled]>
[2019-07-31 01:54:15,790] {{scheduler_job.py:921}} INFO - Figuring out
tasks to run in Pool(name=default_pool) with 90 open slots and 14 task
instances ready to be queued
(snip)
[2019-07-31 01:54:15,807] {{scheduler_job.py:921}} INFO - Figuring out
tasks to run in Pool(name=short_volume) with 7 open slots and 7 task
instances ready to be queued
(snip)
[2019-07-31 01:54:15,818] {{scheduler_job.py:992}} INFO - Setting the
follow tasks to queued state:

(blank newline and nothing else, then the log starts repeating)

On Wed, Jul 31, 2019 at 2:51 PM Bolke de Bruin <bd...@gmail.com> wrote:

> Is this all with celery? Afaik Lyft runs with celery? Also if I remember
> correctly the Google guys had a fix for this but that hasn't been
> upstreamed yet?
>
> With celery task do get "lost" after a while with a certain setting (on a
> phone so don't have it handy, we do set a higher default)
>
> Can you check when those tasks got into "scheduled" and what the time
> difference is with "now"?
>
> B.
>
> Sent from my iPhone
>
> > On 31 Jul 2019, at 20:56, James Meickle <jm...@quantopian.com.invalid>
> wrote:
> >
> > Ash:
> >
> > We definitely don't run thousands of tasks. Looks like it's closer to 300
> > per execution date (and then retries), if I'm using the TI browser right.
> >
> > In my case, I found 21 tasks in "scheduled" state after 1 day of not
> > restarting. One of our hourly "canary" DAGs got included in the pile-up -
> > so it didn't run that hour as expected, so I got paged. (But it wasn't
> just
> > canary tasks, the other 20 tasks were all real and important workflows
> that
> > were not getting scheduled.)
> >
> > If we do change the scheduler to have a "cleanup" step within the loop
> > instead of pre/post loop, I'd suggest we should:
> > - Make the time between cleanups a configurable parameter
> > - Log what cleanup steps are being taken and how long they take
> > - Add new statsd metrics around cleanups (like "number of orphans
> reset"),
> > to help us understand when and why this happens
> >
> >
> >
> >> On Wed, Jul 31, 2019 at 1:25 PM Tao Feng <fe...@gmail.com> wrote:
> >>
> >> Late in the game as I was't aware of `run_duration` option been removed.
> >> But just want to point out that Lyft also did very similar with James'
> >> setup that we run the scheduler in a fix internal instead of infinite
> loop
> >> and let the runit/supervisor to restart the scheduler process. This is
> to
> >> solve: 1. orphaned tasks not getting clean up successfully when it runs
> on
> >> infinite loop; 2. Make sure stale / deleted DAG will get clean up(
> >>
> >>
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438
> >> ?
> >> <
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438
> ?>
> >> )
> >> properly.
> >>
> >> I think if it goes with removing this option and let scheduler runs in
> an
> >> infinite loop, we need to change the schedule loop to handle the clean
> up
> >> process if it hasn't been done.
> >>
> >>> On Wed, Jul 31, 2019 at 10:10 AM Ash Berlin-Taylor <as...@apache.org>
> wrote:
> >>>
> >>> Thanks for testing this out James, shame to discover we still have
> >>> problems in that area. Do you have an idea of how many tasks per day we
> >> are
> >>> talking about here?
> >>>
> >>> Your cluster schedules quite a large number of tasks over the day (in
> the
> >>> 1k-10k range?) right?
> >>>
> >>> I'd say whatever causes a task to become orphaned _while_ the scheduler
> >> is
> >>> still running is the actual bug, and running the orphan detection more
> >>> often may just be replacing one patch (the run duration) with another
> one
> >>> (running the orphan detection more than at start up).
> >>>
> >>> -ash
> >>>
> >>>> On 31 Jul 2019, at 16:43, James Meickle <jmeickle@quantopian.com
> >> .INVALID>
> >>> wrote:
> >>>>
> >>>> In my testing of 1.10.4rc3, I discovered that we were getting hit by a
> >>>> process leak bug (which Ash has since fixed in 1.10.4rc4). This
> process
> >>>> leak was minimal impact for most users, but was exacerbated in our
> case
> >>> by
> >>>> using "run_duration" to restart the scheduler every 10 minutes.
> >>>>
> >>>> To mitigate that issue while remaining on the RC, we removed the use
> of
> >>>> "run_duration", since it is deprecated as of master anyways:
> >>>>
> >>>
> >>
> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration
> >>>>
> >>>> Unfortunately, testing on our cluster (1.10.4rc3 plus no
> >> "run_duration")
> >>>> has revealed that while the process leak issue was mitigated, that
> >> we're
> >>>> now facing issues with orphaned tasks. These tasks are marked as
> >>>> "scheduled" by the scheduler, but _not_ successfully queued in Celery
> >>> even
> >>>> after multiple scheduler loops. Around ~24h after last restart, we
> >> start
> >>>> having enough stuck tasks that the system starts paging and requires a
> >>>> manual restart.
> >>>>
> >>>> Rather than generic "scheduler instability", this specific issue was
> >> one
> >>> of
> >>>> the reasons why we'd originally added the scheduler restart. But it
> >>> appears
> >>>> that on master, the orphaned task detection code still only runs on
> >>>> scheduler start despite removing "run_duration":
> >>>>
> >>>
> >>
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328
> >>>>
> >>>> Rather than immediately filing an issue I wanted to inquire a bit more
> >>>> about why this orphan detection code is only run on scheduler start,
> >>>> whether it would be safe to send in a PR to run it more often (maybe a
> >>>> tunable parameter?), and if there's some other configuration issue
> with
> >>>> Celery (in our case, backed by AWS Elasticache) that would cause us to
> >>> see
> >>>> orphaned tasks frequently.
> >>>
> >>>
> >>
>

Re: Removal of "run_duration" and its impact on orphaned tasks

Posted by Bolke de Bruin <bd...@gmail.com>.
Is this all with celery? Afaik Lyft runs with celery? Also if I remember correctly the Google guys had a fix for this but that hasn't been upstreamed yet?

With celery task do get "lost" after a while with a certain setting (on a phone so don't have it handy, we do set a higher default) 

Can you check when those tasks got into "scheduled" and what the time difference is with "now"?

B.

Sent from my iPhone

> On 31 Jul 2019, at 20:56, James Meickle <jm...@quantopian.com.invalid> wrote:
> 
> Ash:
> 
> We definitely don't run thousands of tasks. Looks like it's closer to 300
> per execution date (and then retries), if I'm using the TI browser right.
> 
> In my case, I found 21 tasks in "scheduled" state after 1 day of not
> restarting. One of our hourly "canary" DAGs got included in the pile-up -
> so it didn't run that hour as expected, so I got paged. (But it wasn't just
> canary tasks, the other 20 tasks were all real and important workflows that
> were not getting scheduled.)
> 
> If we do change the scheduler to have a "cleanup" step within the loop
> instead of pre/post loop, I'd suggest we should:
> - Make the time between cleanups a configurable parameter
> - Log what cleanup steps are being taken and how long they take
> - Add new statsd metrics around cleanups (like "number of orphans reset"),
> to help us understand when and why this happens
> 
> 
> 
>> On Wed, Jul 31, 2019 at 1:25 PM Tao Feng <fe...@gmail.com> wrote:
>> 
>> Late in the game as I was't aware of `run_duration` option been removed.
>> But just want to point out that Lyft also did very similar with James'
>> setup that we run the scheduler in a fix internal instead of infinite loop
>> and let the runit/supervisor to restart the scheduler process. This is to
>> solve: 1. orphaned tasks not getting clean up successfully when it runs on
>> infinite loop; 2. Make sure stale / deleted DAG will get clean up(
>> 
>> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438
>> ?
>> <https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438?>
>> )
>> properly.
>> 
>> I think if it goes with removing this option and let scheduler runs in an
>> infinite loop, we need to change the schedule loop to handle the clean up
>> process if it hasn't been done.
>> 
>>> On Wed, Jul 31, 2019 at 10:10 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>>> 
>>> Thanks for testing this out James, shame to discover we still have
>>> problems in that area. Do you have an idea of how many tasks per day we
>> are
>>> talking about here?
>>> 
>>> Your cluster schedules quite a large number of tasks over the day (in the
>>> 1k-10k range?) right?
>>> 
>>> I'd say whatever causes a task to become orphaned _while_ the scheduler
>> is
>>> still running is the actual bug, and running the orphan detection more
>>> often may just be replacing one patch (the run duration) with another one
>>> (running the orphan detection more than at start up).
>>> 
>>> -ash
>>> 
>>>> On 31 Jul 2019, at 16:43, James Meickle <jmeickle@quantopian.com
>> .INVALID>
>>> wrote:
>>>> 
>>>> In my testing of 1.10.4rc3, I discovered that we were getting hit by a
>>>> process leak bug (which Ash has since fixed in 1.10.4rc4). This process
>>>> leak was minimal impact for most users, but was exacerbated in our case
>>> by
>>>> using "run_duration" to restart the scheduler every 10 minutes.
>>>> 
>>>> To mitigate that issue while remaining on the RC, we removed the use of
>>>> "run_duration", since it is deprecated as of master anyways:
>>>> 
>>> 
>> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration
>>>> 
>>>> Unfortunately, testing on our cluster (1.10.4rc3 plus no
>> "run_duration")
>>>> has revealed that while the process leak issue was mitigated, that
>> we're
>>>> now facing issues with orphaned tasks. These tasks are marked as
>>>> "scheduled" by the scheduler, but _not_ successfully queued in Celery
>>> even
>>>> after multiple scheduler loops. Around ~24h after last restart, we
>> start
>>>> having enough stuck tasks that the system starts paging and requires a
>>>> manual restart.
>>>> 
>>>> Rather than generic "scheduler instability", this specific issue was
>> one
>>> of
>>>> the reasons why we'd originally added the scheduler restart. But it
>>> appears
>>>> that on master, the orphaned task detection code still only runs on
>>>> scheduler start despite removing "run_duration":
>>>> 
>>> 
>> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328
>>>> 
>>>> Rather than immediately filing an issue I wanted to inquire a bit more
>>>> about why this orphan detection code is only run on scheduler start,
>>>> whether it would be safe to send in a PR to run it more often (maybe a
>>>> tunable parameter?), and if there's some other configuration issue with
>>>> Celery (in our case, backed by AWS Elasticache) that would cause us to
>>> see
>>>> orphaned tasks frequently.
>>> 
>>> 
>> 

Re: Removal of "run_duration" and its impact on orphaned tasks

Posted by James Meickle <jm...@quantopian.com.INVALID>.
Ash:

We definitely don't run thousands of tasks. Looks like it's closer to 300
per execution date (and then retries), if I'm using the TI browser right.

In my case, I found 21 tasks in "scheduled" state after 1 day of not
restarting. One of our hourly "canary" DAGs got included in the pile-up -
so it didn't run that hour as expected, so I got paged. (But it wasn't just
canary tasks, the other 20 tasks were all real and important workflows that
were not getting scheduled.)

If we do change the scheduler to have a "cleanup" step within the loop
instead of pre/post loop, I'd suggest we should:
- Make the time between cleanups a configurable parameter
- Log what cleanup steps are being taken and how long they take
- Add new statsd metrics around cleanups (like "number of orphans reset"),
to help us understand when and why this happens



On Wed, Jul 31, 2019 at 1:25 PM Tao Feng <fe...@gmail.com> wrote:

> Late in the game as I was't aware of `run_duration` option been removed.
> But just want to point out that Lyft also did very similar with James'
> setup that we run the scheduler in a fix internal instead of infinite loop
> and let the runit/supervisor to restart the scheduler process. This is to
> solve: 1. orphaned tasks not getting clean up successfully when it runs on
> infinite loop; 2. Make sure stale / deleted DAG will get clean up(
>
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438
> ?
> <https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438?>
> )
> properly.
>
> I think if it goes with removing this option and let scheduler runs in an
> infinite loop, we need to change the schedule loop to handle the clean up
> process if it hasn't been done.
>
> On Wed, Jul 31, 2019 at 10:10 AM Ash Berlin-Taylor <as...@apache.org> wrote:
>
> > Thanks for testing this out James, shame to discover we still have
> > problems in that area. Do you have an idea of how many tasks per day we
> are
> > talking about here?
> >
> > Your cluster schedules quite a large number of tasks over the day (in the
> > 1k-10k range?) right?
> >
> > I'd say whatever causes a task to become orphaned _while_ the scheduler
> is
> > still running is the actual bug, and running the orphan detection more
> > often may just be replacing one patch (the run duration) with another one
> > (running the orphan detection more than at start up).
> >
> > -ash
> >
> > > On 31 Jul 2019, at 16:43, James Meickle <jmeickle@quantopian.com
> .INVALID>
> > wrote:
> > >
> > > In my testing of 1.10.4rc3, I discovered that we were getting hit by a
> > > process leak bug (which Ash has since fixed in 1.10.4rc4). This process
> > > leak was minimal impact for most users, but was exacerbated in our case
> > by
> > > using "run_duration" to restart the scheduler every 10 minutes.
> > >
> > > To mitigate that issue while remaining on the RC, we removed the use of
> > > "run_duration", since it is deprecated as of master anyways:
> > >
> >
> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration
> > >
> > > Unfortunately, testing on our cluster (1.10.4rc3 plus no
> "run_duration")
> > > has revealed that while the process leak issue was mitigated, that
> we're
> > > now facing issues with orphaned tasks. These tasks are marked as
> > > "scheduled" by the scheduler, but _not_ successfully queued in Celery
> > even
> > > after multiple scheduler loops. Around ~24h after last restart, we
> start
> > > having enough stuck tasks that the system starts paging and requires a
> > > manual restart.
> > >
> > > Rather than generic "scheduler instability", this specific issue was
> one
> > of
> > > the reasons why we'd originally added the scheduler restart. But it
> > appears
> > > that on master, the orphaned task detection code still only runs on
> > > scheduler start despite removing "run_duration":
> > >
> >
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328
> > >
> > > Rather than immediately filing an issue I wanted to inquire a bit more
> > > about why this orphan detection code is only run on scheduler start,
> > > whether it would be safe to send in a PR to run it more often (maybe a
> > > tunable parameter?), and if there's some other configuration issue with
> > > Celery (in our case, backed by AWS Elasticache) that would cause us to
> > see
> > > orphaned tasks frequently.
> >
> >
>

Re: Removal of "run_duration" and its impact on orphaned tasks

Posted by Tao Feng <fe...@gmail.com>.
Late in the game as I was't aware of `run_duration` option been removed.
But just want to point out that Lyft also did very similar with James'
setup that we run the scheduler in a fix internal instead of infinite loop
and let the runit/supervisor to restart the scheduler process. This is to
solve: 1. orphaned tasks not getting clean up successfully when it runs on
infinite loop; 2. Make sure stale / deleted DAG will get clean up(
https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1438
?)
properly.

I think if it goes with removing this option and let scheduler runs in an
infinite loop, we need to change the schedule loop to handle the clean up
process if it hasn't been done.

On Wed, Jul 31, 2019 at 10:10 AM Ash Berlin-Taylor <as...@apache.org> wrote:

> Thanks for testing this out James, shame to discover we still have
> problems in that area. Do you have an idea of how many tasks per day we are
> talking about here?
>
> Your cluster schedules quite a large number of tasks over the day (in the
> 1k-10k range?) right?
>
> I'd say whatever causes a task to become orphaned _while_ the scheduler is
> still running is the actual bug, and running the orphan detection more
> often may just be replacing one patch (the run duration) with another one
> (running the orphan detection more than at start up).
>
> -ash
>
> > On 31 Jul 2019, at 16:43, James Meickle <jm...@quantopian.com.INVALID>
> wrote:
> >
> > In my testing of 1.10.4rc3, I discovered that we were getting hit by a
> > process leak bug (which Ash has since fixed in 1.10.4rc4). This process
> > leak was minimal impact for most users, but was exacerbated in our case
> by
> > using "run_duration" to restart the scheduler every 10 minutes.
> >
> > To mitigate that issue while remaining on the RC, we removed the use of
> > "run_duration", since it is deprecated as of master anyways:
> >
> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration
> >
> > Unfortunately, testing on our cluster (1.10.4rc3 plus no "run_duration")
> > has revealed that while the process leak issue was mitigated, that we're
> > now facing issues with orphaned tasks. These tasks are marked as
> > "scheduled" by the scheduler, but _not_ successfully queued in Celery
> even
> > after multiple scheduler loops. Around ~24h after last restart, we start
> > having enough stuck tasks that the system starts paging and requires a
> > manual restart.
> >
> > Rather than generic "scheduler instability", this specific issue was one
> of
> > the reasons why we'd originally added the scheduler restart. But it
> appears
> > that on master, the orphaned task detection code still only runs on
> > scheduler start despite removing "run_duration":
> >
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328
> >
> > Rather than immediately filing an issue I wanted to inquire a bit more
> > about why this orphan detection code is only run on scheduler start,
> > whether it would be safe to send in a PR to run it more often (maybe a
> > tunable parameter?), and if there's some other configuration issue with
> > Celery (in our case, backed by AWS Elasticache) that would cause us to
> see
> > orphaned tasks frequently.
>
>

Re: Removal of "run_duration" and its impact on orphaned tasks

Posted by Ash Berlin-Taylor <as...@apache.org>.
Thanks for testing this out James, shame to discover we still have problems in that area. Do you have an idea of how many tasks per day we are talking about here?

Your cluster schedules quite a large number of tasks over the day (in the 1k-10k range?) right?

I'd say whatever causes a task to become orphaned _while_ the scheduler is still running is the actual bug, and running the orphan detection more often may just be replacing one patch (the run duration) with another one (running the orphan detection more than at start up).

-ash

> On 31 Jul 2019, at 16:43, James Meickle <jm...@quantopian.com.INVALID> wrote:
> 
> In my testing of 1.10.4rc3, I discovered that we were getting hit by a
> process leak bug (which Ash has since fixed in 1.10.4rc4). This process
> leak was minimal impact for most users, but was exacerbated in our case by
> using "run_duration" to restart the scheduler every 10 minutes.
> 
> To mitigate that issue while remaining on the RC, we removed the use of
> "run_duration", since it is deprecated as of master anyways:
> https://github.com/apache/airflow/blob/master/UPDATING.md#remove-run_duration
> 
> Unfortunately, testing on our cluster (1.10.4rc3 plus no "run_duration")
> has revealed that while the process leak issue was mitigated, that we're
> now facing issues with orphaned tasks. These tasks are marked as
> "scheduled" by the scheduler, but _not_ successfully queued in Celery even
> after multiple scheduler loops. Around ~24h after last restart, we start
> having enough stuck tasks that the system starts paging and requires a
> manual restart.
> 
> Rather than generic "scheduler instability", this specific issue was one of
> the reasons why we'd originally added the scheduler restart. But it appears
> that on master, the orphaned task detection code still only runs on
> scheduler start despite removing "run_duration":
> https://github.com/apache/airflow/blob/master/airflow/jobs/scheduler_job.py#L1328
> 
> Rather than immediately filing an issue I wanted to inquire a bit more
> about why this orphan detection code is only run on scheduler start,
> whether it would be safe to send in a PR to run it more often (maybe a
> tunable parameter?), and if there's some other configuration issue with
> Celery (in our case, backed by AWS Elasticache) that would cause us to see
> orphaned tasks frequently.