You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Andrew Phillips <an...@apache.org> on 2016/08/09 15:50:05 UTC

Restarting the scheduler regularly - still current advice?

Hi all

I just wanted to check to what extent the advice in [1] and [2], namely
to restart the scheduler "every once in a while", is still considered
accurate?

"Restart your scheduler process to get a clean environment every once in
a while. Use --num_runs N scheduler CLI option to make it stop after N
runs and have some supervisor ensuring it is always running. See issue
698"

"The scheduler should be restarted frequently

In our experience, a long running scheduler process, at least with the
CeleryExecutor, ends up not scheduling some tasks. We still don\u2019t know
the exact cause, unfortunately.

Fortunately, airflow has a built-in workaround in the form of
the\u200a\u2014\u200anum_runs flag. It specifies a number of iterations for the
scheduler to run of its loop before it quits. We\u2019re running it with 10
iterations, Airbnb runs it with 5. Note that this will cause problems
when using the LocalExecutor."

Both documents are pretty now, so I assume this is considered still
relevant. Could you give some guidance on what kind of frequency is
recommended here, or is that very dependent on the particular
installation?

Also, which of the current JIRA issues (if any) is the new version of
"issue 698" as mentioned in the first quote? There seem to be quite a
few issues relating to the scheduler getting stuck [3] - which one(s)
should we follow and/or add information to to best track progress on
this topic?

Thanks!

[1] https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
[2]
https://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb#.ahcprdr9r
[3]
https://issues.apache.org/jira/browse/AIRFLOW-39?jql=project%20%3D%20AIRFLOW%20AND%20text%20~%20%22scheduler%22

Re: Restarting the scheduler regularly - still current advice?

Posted by Bolke de Bruin <bd...@gmail.com>.

Pools are an outstanding issue, which I am working on but won’t be part of the Apache release. The Apache release will initially just be 1.7.1.3 + Licenses - Highcharts + D3.

- B.

> Op 9 aug. 2016, om 21:44 heeft Lance Norskog <la...@gmail.com> het volgende geschreven:
> 
> We are on 1.6.2 and would love to upgrade to a modern version. We were
> holding out for the first Apache release.
> 
> Also, we have cases where the various concurrent task limits are ignored
> and we have 50 tasks scheduled at once. A DAG like this:
> 
> dag = DAG(
>    dag_id='xxx', x
>    schedule_interval="0 "+str(hour)+" * * *",
>    max_active_runs=1,
>    concurrency=1
>    )
> 
> and where all of the tasks use the same Pool. max_active_runs= is ignored,
> concurrency= is ignored, the Pool is ignored.
> 
> 
> On Tue, Aug 9, 2016 at 12:08 PM, Bolke de Bruin <bd...@gmail.com> wrote:
> 
>> I disagree. Num_runs should NOT be used anymore and I would really like to
>> know ‘stuck’ schedulers on release or on master, preferably with celery
>> executor (LocalExecutor can sometimes look stuck but isn’t). Restarting
>> should only be required for clearing up database connections as we are not
>> very good at that yet.
>> 
>> - Bolke
>> 
>>> Op 9 aug. 2016, om 20:30 heeft Lance Norskog <la...@gmail.com>
>> het volgende geschreven:
>>> 
>>> Yes, it is still current advice.
>>> 
>>> My experience is that after running for (let's say) days, the app
>> develops
>>> memory corruption. I've seen three different ways that memory corruption
>>> shows up. The scheduler failure is just one of these three symptoms.
>>> 
>>> The other two symptoms are
>>> 1) the main page of the UI shows a different list of running DAGs than is
>>> what is really configured,
>>> 2) a task contains some configuration data that should be in a
>> neighboring
>>> task, and fails.
>>> 
>>> Frankly, I would configure all 5 daemons to restart periodically, not
>> just
>>> the scheduler daemon.
>>> 
>>> 
>>> On Tue, Aug 9, 2016 at 8:50 AM, Andrew Phillips <an...@apache.org>
>> wrote:
>>> 
>>>> Hi all
>>>> 
>>>> I just wanted to check to what extent the advice in [1] and [2], namely
>> to
>>>> restart the scheduler "every once in a while", is still considered
>> accurate?
>>>> 
>>>> "Restart your scheduler process to get a clean environment every once
>> in a
>>>> while. Use --num_runs N scheduler CLI option to make it stop after N
>> runs
>>>> and have some supervisor ensuring it is always running. See issue 698"
>>>> 
>>>> "The scheduler should be restarted frequently
>>>> 
>>>> In our experience, a long running scheduler process, at least with the
>>>> CeleryExecutor, ends up not scheduling some tasks. We still don’t know
>> the
>>>> exact cause, unfortunately.
>>>> 
>>>> Fortunately, airflow has a built-in workaround in the form of the —
>>>> num_runs flag. It specifies a number of iterations for the scheduler to
>> run
>>>> of its loop before it quits. We’re running it with 10 iterations, Airbnb
>>>> runs it with 5. Note that this will cause problems when using the
>>>> LocalExecutor."
>>>> 
>>>> Both documents are pretty now, so I assume this is considered still
>>>> relevant. Could you give some guidance on what kind of frequency is
>>>> recommended here, or is that very dependent on the particular
>> installation?
>>>> 
>>>> Also, which of the current JIRA issues (if any) is the new version of
>>>> "issue 698" as mentioned in the first quote? There seem to be quite a
>> few
>>>> issues relating to the scheduler getting stuck [3] - which one(s)
>> should we
>>>> follow and/or add information to to best track progress on this topic?
>>>> 
>>>> Thanks!
>>>> 
>>>> ap
>>>> 
>>>> 
>>>> [1] https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
>>>> [2] https://medium.com/handy-tech/airflow-tips-tricks-and-pitfal
>>>> ls-9ba53fba14eb#.ahcprdr9r
>>>> [3] https://issues.apache.org/jira/browse/AIRFLOW-39?jql=project
>>>> %20%3D%20AIRFLOW%20AND%20text%20~%20%22scheduler%22
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Lance Norskog
>>> lance.norskog@gmail.com
>>> Redwood City, CA
>> 
>> 
> 
> 
> -- 
> Lance Norskog
> lance.norskog@gmail.com
> Redwood City, CA

Re: Restarting the scheduler regularly - still current advice?

Posted by Lance Norskog <la...@gmail.com>.

We are on 1.6.2 and would love to upgrade to a modern version. We were
holding out for the first Apache release.

Also, we have cases where the various concurrent task limits are ignored
and we have 50 tasks scheduled at once. A DAG like this:

dag = DAG(
    dag_id='xxx', x
    schedule_interval="0 "+str(hour)+" * * *",
    max_active_runs=1,
    concurrency=1
    )

and where all of the tasks use the same Pool. max_active_runs= is ignored,
concurrency= is ignored, the Pool is ignored.


On Tue, Aug 9, 2016 at 12:08 PM, Bolke de Bruin <bd...@gmail.com> wrote:

> I disagree. Num_runs should NOT be used anymore and I would really like to
> know ‘stuck’ schedulers on release or on master, preferably with celery
> executor (LocalExecutor can sometimes look stuck but isn’t). Restarting
> should only be required for clearing up database connections as we are not
> very good at that yet.
>
> - Bolke
>
> > Op 9 aug. 2016, om 20:30 heeft Lance Norskog <la...@gmail.com>
> het volgende geschreven:
> >
> > Yes, it is still current advice.
> >
> > My experience is that after running for (let's say) days, the app
> develops
> > memory corruption. I've seen three different ways that memory corruption
> > shows up. The scheduler failure is just one of these three symptoms.
> >
> > The other two symptoms are
> > 1) the main page of the UI shows a different list of running DAGs than is
> > what is really configured,
> > 2) a task contains some configuration data that should be in a
> neighboring
> > task, and fails.
> >
> > Frankly, I would configure all 5 daemons to restart periodically, not
> just
> > the scheduler daemon.
> >
> >
> > On Tue, Aug 9, 2016 at 8:50 AM, Andrew Phillips <an...@apache.org>
> wrote:
> >
> >> Hi all
> >>
> >> I just wanted to check to what extent the advice in [1] and [2], namely
> to
> >> restart the scheduler "every once in a while", is still considered
> accurate?
> >>
> >> "Restart your scheduler process to get a clean environment every once
> in a
> >> while. Use --num_runs N scheduler CLI option to make it stop after N
> runs
> >> and have some supervisor ensuring it is always running. See issue 698"
> >>
> >> "The scheduler should be restarted frequently
> >>
> >> In our experience, a long running scheduler process, at least with the
> >> CeleryExecutor, ends up not scheduling some tasks. We still don’t know
> the
> >> exact cause, unfortunately.
> >>
> >> Fortunately, airflow has a built-in workaround in the form of the —
> >> num_runs flag. It specifies a number of iterations for the scheduler to
> run
> >> of its loop before it quits. We’re running it with 10 iterations, Airbnb
> >> runs it with 5. Note that this will cause problems when using the
> >> LocalExecutor."
> >>
> >> Both documents are pretty now, so I assume this is considered still
> >> relevant. Could you give some guidance on what kind of frequency is
> >> recommended here, or is that very dependent on the particular
> installation?
> >>
> >> Also, which of the current JIRA issues (if any) is the new version of
> >> "issue 698" as mentioned in the first quote? There seem to be quite a
> few
> >> issues relating to the scheduler getting stuck [3] - which one(s)
> should we
> >> follow and/or add information to to best track progress on this topic?
> >>
> >> Thanks!
> >>
> >> ap
> >>
> >>
> >> [1] https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
> >> [2] https://medium.com/handy-tech/airflow-tips-tricks-and-pitfal
> >> ls-9ba53fba14eb#.ahcprdr9r
> >> [3] https://issues.apache.org/jira/browse/AIRFLOW-39?jql=project
> >> %20%3D%20AIRFLOW%20AND%20text%20~%20%22scheduler%22
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > lance.norskog@gmail.com
> > Redwood City, CA
>
>


-- 
Lance Norskog
lance.norskog@gmail.com
Redwood City, CA

Re: Restarting the scheduler regularly - still current advice?

Posted by Bolke de Bruin <bd...@gmail.com>.

I disagree. Num_runs should NOT be used anymore and I would really like to know ‘stuck’ schedulers on release or on master, preferably with celery executor (LocalExecutor can sometimes look stuck but isn’t). Restarting should only be required for clearing up database connections as we are not very good at that yet.

- Bolke

> Op 9 aug. 2016, om 20:30 heeft Lance Norskog <la...@gmail.com> het volgende geschreven:
> 
> Yes, it is still current advice.
> 
> My experience is that after running for (let's say) days, the app develops
> memory corruption. I've seen three different ways that memory corruption
> shows up. The scheduler failure is just one of these three symptoms.
> 
> The other two symptoms are
> 1) the main page of the UI shows a different list of running DAGs than is
> what is really configured,
> 2) a task contains some configuration data that should be in a neighboring
> task, and fails.
> 
> Frankly, I would configure all 5 daemons to restart periodically, not just
> the scheduler daemon.
> 
> 
> On Tue, Aug 9, 2016 at 8:50 AM, Andrew Phillips <an...@apache.org> wrote:
> 
>> Hi all
>> 
>> I just wanted to check to what extent the advice in [1] and [2], namely to
>> restart the scheduler "every once in a while", is still considered accurate?
>> 
>> "Restart your scheduler process to get a clean environment every once in a
>> while. Use --num_runs N scheduler CLI option to make it stop after N runs
>> and have some supervisor ensuring it is always running. See issue 698"
>> 
>> "The scheduler should be restarted frequently
>> 
>> In our experience, a long running scheduler process, at least with the
>> CeleryExecutor, ends up not scheduling some tasks. We still don’t know the
>> exact cause, unfortunately.
>> 
>> Fortunately, airflow has a built-in workaround in the form of the —
>> num_runs flag. It specifies a number of iterations for the scheduler to run
>> of its loop before it quits. We’re running it with 10 iterations, Airbnb
>> runs it with 5. Note that this will cause problems when using the
>> LocalExecutor."
>> 
>> Both documents are pretty now, so I assume this is considered still
>> relevant. Could you give some guidance on what kind of frequency is
>> recommended here, or is that very dependent on the particular installation?
>> 
>> Also, which of the current JIRA issues (if any) is the new version of
>> "issue 698" as mentioned in the first quote? There seem to be quite a few
>> issues relating to the scheduler getting stuck [3] - which one(s) should we
>> follow and/or add information to to best track progress on this topic?
>> 
>> Thanks!
>> 
>> ap
>> 
>> 
>> [1] https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
>> [2] https://medium.com/handy-tech/airflow-tips-tricks-and-pitfal
>> ls-9ba53fba14eb#.ahcprdr9r
>> [3] https://issues.apache.org/jira/browse/AIRFLOW-39?jql=project
>> %20%3D%20AIRFLOW%20AND%20text%20~%20%22scheduler%22
>> 
> 
> 
> 
> -- 
> Lance Norskog
> lance.norskog@gmail.com
> Redwood City, CA

Re: Restarting the scheduler regularly - still current advice?

Posted by Lance Norskog <la...@gmail.com>.

Yes, it is still current advice.

My experience is that after running for (let's say) days, the app develops
memory corruption. I've seen three different ways that memory corruption
shows up. The scheduler failure is just one of these three symptoms.

The other two symptoms are
1) the main page of the UI shows a different list of running DAGs than is
what is really configured,
2) a task contains some configuration data that should be in a neighboring
task, and fails.

Frankly, I would configure all 5 daemons to restart periodically, not just
the scheduler daemon.


On Tue, Aug 9, 2016 at 8:50 AM, Andrew Phillips <an...@apache.org> wrote:

> Hi all
>
> I just wanted to check to what extent the advice in [1] and [2], namely to
> restart the scheduler "every once in a while", is still considered accurate?
>
> "Restart your scheduler process to get a clean environment every once in a
> while. Use --num_runs N scheduler CLI option to make it stop after N runs
> and have some supervisor ensuring it is always running. See issue 698"
>
> "The scheduler should be restarted frequently
>
> In our experience, a long running scheduler process, at least with the
> CeleryExecutor, ends up not scheduling some tasks. We still don’t know the
> exact cause, unfortunately.
>
> Fortunately, airflow has a built-in workaround in the form of the —
> num_runs flag. It specifies a number of iterations for the scheduler to run
> of its loop before it quits. We’re running it with 10 iterations, Airbnb
> runs it with 5. Note that this will cause problems when using the
> LocalExecutor."
>
> Both documents are pretty now, so I assume this is considered still
> relevant. Could you give some guidance on what kind of frequency is
> recommended here, or is that very dependent on the particular installation?
>
> Also, which of the current JIRA issues (if any) is the new version of
> "issue 698" as mentioned in the first quote? There seem to be quite a few
> issues relating to the scheduler getting stuck [3] - which one(s) should we
> follow and/or add information to to best track progress on this topic?
>
> Thanks!
>
> ap
>
>
> [1] https://cwiki.apache.org/confluence/display/AIRFLOW/Common+Pitfalls
> [2] https://medium.com/handy-tech/airflow-tips-tricks-and-pitfal
> ls-9ba53fba14eb#.ahcprdr9r
> [3] https://issues.apache.org/jira/browse/AIRFLOW-39?jql=project
> %20%3D%20AIRFLOW%20AND%20text%20~%20%22scheduler%22
>



-- 
Lance Norskog
lance.norskog@gmail.com
Redwood City, CA