You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by harish singh <ha...@gmail.com> on 2016/12/03 23:52:27 UTC

Performance: backfill --mark_success

Hi all,

We have been running Airflow in our production for over 8-9 months now.
I know there is a separate thread in place for Airflow 2.0.
But I was not sure if any of the prior version has this fixed.  If not, I
will add this to the other email thread for 2.0.

When I run airflow backfill with "-m"  (Mark jobs as succeeded without
running them) ,
is there a way to optimize this call?

For example:
airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00 -m

Here, I am running backfill for a month (from 1st Nov to 1st Dec).
Essentially, Marking the jobs as succeeded without running them.

It has ben more than an hour and the backfill has managed to reach only
upto 2nd Nov.
This seems to be very slow when there is no need to even run the tasks.


I am running Airflow 1.7.0:
These are my related configuration settings:

parallelism = 50
dag_concurrency = 20
max_active_runs_per_dag = 8

Also, I have around 9 Dags running (all Hourly). The other 8 dags are
running as scheduled with start_date of 2016-11-01T00:00:00

My question is, since I am only Marking the jobs as "succeeded"
without running them,
can this be done over 1 sql query, instead of per hour, per task basis?
May be find out all the TaskInstances that needs to be mark succeeded
and then just run a sql?

I may not be aware of lot of things here and very possible I am
assuming a lot of things, incorrectly.
Please feel free to correct me.


Thanks,
Harish

Re: Performance: backfill --mark_success

Posted by Maxime Beauchemin <ma...@gmail.com>.

`job_heartbeat_sec` is a configuration parameter (airflow.cfg) that sets a
default for how often jobs wait between "cycles". Lowering it in your dev
environment will make backfills go faster.

Max

On Tue, Dec 6, 2016 at 1:44 PM, harish singh <ha...@gmail.com>
wrote:

> I am working around this in a not-so-pretty-hacky solution.
> Instead of "backfill -m" for the dag I am using the "-t"   flag and
> marking success only the 1st Task of my pipeline.  Once the backfill is
> complete, I used the UI to "Mark success" all "Future" and "Downstream"
> tasks.
>
> Max,
> I am not sure I clearly understood about individual "heartrate" per job.
> Can "job_heartbeat_sec" specified for per Task basis?
>
> Thanks,
> Harish
>
> On Tue, Dec 6, 2016 at 9:31 AM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
>> Oh thanks for pointing this out, I just did a round of review on that PR.
>>
>> While we have people's attention around backfill on this thread, I'd love
>> to introduce the new term "scheduler catchup" as something distinct to
>> `backfill`, at least until we get single code path for both operations.
>>
>> Max
>>
>> On Tue, Dec 6, 2016 at 8:56 AM, Bolke de Bruin <bd...@gmail.com> wrote:
>>
>>> There is pr out for not having backfills at all specified at the dag
>>> level as well.
>>>
>>>
>>>
>>> *Van: *Maxime Beauchemin <ma...@gmail.com>
>>> *Verzonden: *dinsdag 6 december 2016 16:54
>>> *Aan: *dev@airflow.incubator.apache.org
>>> *CC: *harish.singh22@gmail.com
>>> *Onderwerp: *Re: Performance: backfill --mark_success
>>>
>>>
>>>
>>> The backfill `mark_success` logic could really be optimized by not
>>> relying
>>>
>>> on `airflow run --mark_success` by altering the database state directly
>>>
>>> instead of actually triggering tasks at all and relying on the backfill
>>>
>>> logic. Simply scope the set of task instances in scope, and merge
>>> (upsert)
>>>
>>> a `success` state to the db directly.
>>>
>>>
>>>
>>> To accelerate it though as it is today, you can reduce some of the
>>>
>>> heartbeats configurations (job_heartbeat_sec). It's usually desirable to
>>>
>>> have this setting lower in dev (say 5 seconds) than in production (30-60
>>>
>>> seconds).
>>>
>>>
>>>
>>> I suggest that better default that would be individually configurable for
>>>
>>> `heartrate` be in place for different types of jobs in `jobs.py`.
>>>
>>>
>>>
>>> Max
>>>
>>>
>>>
>>> On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <ll...@industrydive.com>
>>>
>>> wrote:
>>>
>>>
>>>
>>> > This is not that helpful of a message, but I also was having a problem
>>> with
>>>
>>> > `airflow backfill -m` on Airflow version 1.7.0 with it going super
>>> slow. In
>>>
>>> > the end I got around the necessity in that specific case, thinking
>>> that it
>>>
>>> > was broken in 1.7.0 re (
>>>
>>> > https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee
>>> ),
>>>
>>> > but now that I am writing this and triangulating 1.7.0's release and
>>> that
>>>
>>> > gitter comment, it doesn't make sense. I'll give it another go.
>>>
>>> >
>>>
>>> > On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <
>>> vikas.malhotra08@gmail.com
>>>
>>> > >
>>>
>>> > wrote:
>>>
>>> >
>>>
>>> > > Hello Harish,
>>>
>>> > >
>>>
>>> > > Based on our understanding of Python Multiprocessing, a task instance
>>>
>>> > gets
>>>
>>> > > a record in underlying database after there is an explicit call to
>>>
>>> > airflow
>>>
>>> > > from that library (using Local Executor). So, I might be wrong, but
>>> you
>>>
>>> > > won't find a record in database until and unless that task instance
>>> has
>>>
>>> > got
>>>
>>> > > initiated. I might be wrong in our assumptions and would love to be
>>>
>>> > > corrected if that's the case.
>>>
>>> > >
>>>
>>> > > We have been using latest only operator and it's seems to be working
>>> well
>>>
>>> > > for skipping tasks if they are not current (basically avoiding
>>> backfill
>>>
>>> > by
>>>
>>> > > marking all tasks below the latest only operator as skipped). It's
>>>
>>> > present
>>>
>>> > > in master branch as of now and I would recommend you to look at that
>>>
>>> > > operator for backfill.
>>>
>>> > >
>>>
>>> > > Thanks!
>>>
>>> > > Vikas
>>>
>>> > >
>>>
>>> > > On Dec 4, 2016 5:23 AM, "harish singh" <ha...@gmail.com>
>>> wrote:
>>>
>>> > >
>>>
>>> > > Hi all,
>>>
>>> > >
>>>
>>> > > We have been running Airflow in our production for over 8-9 months
>>> now.
>>>
>>> > > I know there is a separate thread in place for Airflow 2.0.
>>>
>>> > > But I was not sure if any of the prior version has this fixed.  If
>>> not, I
>>>
>>> > > will add this to the other email thread for 2.0.
>>>
>>> > >
>>>
>>> > > When I run airflow backfill with "-m"  (Mark jobs as succeeded
>>> without
>>>
>>> > > running them) ,
>>>
>>> > > is there a way to optimize this call?
>>>
>>> > >
>>>
>>> > > For example:
>>>
>>> > > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e
>>> 2016-12-01T00:00:00
>>>
>>> > -m
>>>
>>> > >
>>>
>>> > > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
>>>
>>> > > Essentially, Marking the jobs as succeeded without running them.
>>>
>>> > >
>>>
>>> > > It has ben more than an hour and the backfill has managed to reach
>>> only
>>>
>>> > > upto 2nd Nov.
>>>
>>> > > This seems to be very slow when there is no need to even run the
>>> tasks.
>>>
>>> > >
>>>
>>> > >
>>>
>>> > > I am running Airflow 1.7.0:
>>>
>>> > > These are my related configuration settings:
>>>
>>> > >
>>>
>>> > > parallelism = 50
>>>
>>> > > dag_concurrency = 20
>>>
>>> > > max_active_runs_per_dag = 8
>>>
>>> > >
>>>
>>> > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
>>>
>>> > > running as scheduled with start_date of 2016-11-01T00:00:00
>>>
>>> > >
>>>
>>> > > My question is, since I am only Marking the jobs as "succeeded"
>>>
>>> > > without running them,
>>>
>>> > > can this be done over 1 sql query, instead of per hour, per task
>>> basis?
>>>
>>> > > May be find out all the TaskInstances that needs to be mark succeeded
>>>
>>> > > and then just run a sql?
>>>
>>> > >
>>>
>>> > > I may not be aware of lot of things here and very possible I am
>>>
>>> > > assuming a lot of things, incorrectly.
>>>
>>> > > Please feel free to correct me.
>>>
>>> > >
>>>
>>> > >
>>>
>>> > > Thanks,
>>>
>>> > > Harish
>>>
>>> > >
>>>
>>> >
>>>
>>>
>>>
>>
>>
>

Re: Performance: backfill --mark_success

Posted by harish singh <ha...@gmail.com>.

I am working around this in a not-so-pretty-hacky solution.
Instead of "backfill -m" for the dag I am using the "-t"   flag and marking
success only the 1st Task of my pipeline.  Once the backfill is complete, I
used the UI to "Mark success" all "Future" and "Downstream" tasks.

Max,
I am not sure I clearly understood about individual "heartrate" per job.
Can "job_heartbeat_sec" specified for per Task basis?

Thanks,
Harish

On Tue, Dec 6, 2016 at 9:31 AM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Oh thanks for pointing this out, I just did a round of review on that PR.
>
> While we have people's attention around backfill on this thread, I'd love
> to introduce the new term "scheduler catchup" as something distinct to
> `backfill`, at least until we get single code path for both operations.
>
> Max
>
> On Tue, Dec 6, 2016 at 8:56 AM, Bolke de Bruin <bd...@gmail.com> wrote:
>
>> There is pr out for not having backfills at all specified at the dag
>> level as well.
>>
>>
>>
>> *Van: *Maxime Beauchemin <ma...@gmail.com>
>> *Verzonden: *dinsdag 6 december 2016 16:54
>> *Aan: *dev@airflow.incubator.apache.org
>> *CC: *harish.singh22@gmail.com
>> *Onderwerp: *Re: Performance: backfill --mark_success
>>
>>
>>
>> The backfill `mark_success` logic could really be optimized by not relying
>>
>> on `airflow run --mark_success` by altering the database state directly
>>
>> instead of actually triggering tasks at all and relying on the backfill
>>
>> logic. Simply scope the set of task instances in scope, and merge (upsert)
>>
>> a `success` state to the db directly.
>>
>>
>>
>> To accelerate it though as it is today, you can reduce some of the
>>
>> heartbeats configurations (job_heartbeat_sec). It's usually desirable to
>>
>> have this setting lower in dev (say 5 seconds) than in production (30-60
>>
>> seconds).
>>
>>
>>
>> I suggest that better default that would be individually configurable for
>>
>> `heartrate` be in place for different types of jobs in `jobs.py`.
>>
>>
>>
>> Max
>>
>>
>>
>> On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <ll...@industrydive.com>
>>
>> wrote:
>>
>>
>>
>> > This is not that helpful of a message, but I also was having a problem
>> with
>>
>> > `airflow backfill -m` on Airflow version 1.7.0 with it going super
>> slow. In
>>
>> > the end I got around the necessity in that specific case, thinking that
>> it
>>
>> > was broken in 1.7.0 re (
>>
>> > https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee
>> ),
>>
>> > but now that I am writing this and triangulating 1.7.0's release and
>> that
>>
>> > gitter comment, it doesn't make sense. I'll give it another go.
>>
>> >
>>
>> > On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <
>> vikas.malhotra08@gmail.com
>>
>> > >
>>
>> > wrote:
>>
>> >
>>
>> > > Hello Harish,
>>
>> > >
>>
>> > > Based on our understanding of Python Multiprocessing, a task instance
>>
>> > gets
>>
>> > > a record in underlying database after there is an explicit call to
>>
>> > airflow
>>
>> > > from that library (using Local Executor). So, I might be wrong, but
>> you
>>
>> > > won't find a record in database until and unless that task instance
>> has
>>
>> > got
>>
>> > > initiated. I might be wrong in our assumptions and would love to be
>>
>> > > corrected if that's the case.
>>
>> > >
>>
>> > > We have been using latest only operator and it's seems to be working
>> well
>>
>> > > for skipping tasks if they are not current (basically avoiding
>> backfill
>>
>> > by
>>
>> > > marking all tasks below the latest only operator as skipped). It's
>>
>> > present
>>
>> > > in master branch as of now and I would recommend you to look at that
>>
>> > > operator for backfill.
>>
>> > >
>>
>> > > Thanks!
>>
>> > > Vikas
>>
>> > >
>>
>> > > On Dec 4, 2016 5:23 AM, "harish singh" <ha...@gmail.com>
>> wrote:
>>
>> > >
>>
>> > > Hi all,
>>
>> > >
>>
>> > > We have been running Airflow in our production for over 8-9 months
>> now.
>>
>> > > I know there is a separate thread in place for Airflow 2.0.
>>
>> > > But I was not sure if any of the prior version has this fixed.  If
>> not, I
>>
>> > > will add this to the other email thread for 2.0.
>>
>> > >
>>
>> > > When I run airflow backfill with "-m"  (Mark jobs as succeeded without
>>
>> > > running them) ,
>>
>> > > is there a way to optimize this call?
>>
>> > >
>>
>> > > For example:
>>
>> > > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e
>> 2016-12-01T00:00:00
>>
>> > -m
>>
>> > >
>>
>> > > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
>>
>> > > Essentially, Marking the jobs as succeeded without running them.
>>
>> > >
>>
>> > > It has ben more than an hour and the backfill has managed to reach
>> only
>>
>> > > upto 2nd Nov.
>>
>> > > This seems to be very slow when there is no need to even run the
>> tasks.
>>
>> > >
>>
>> > >
>>
>> > > I am running Airflow 1.7.0:
>>
>> > > These are my related configuration settings:
>>
>> > >
>>
>> > > parallelism = 50
>>
>> > > dag_concurrency = 20
>>
>> > > max_active_runs_per_dag = 8
>>
>> > >
>>
>> > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
>>
>> > > running as scheduled with start_date of 2016-11-01T00:00:00
>>
>> > >
>>
>> > > My question is, since I am only Marking the jobs as "succeeded"
>>
>> > > without running them,
>>
>> > > can this be done over 1 sql query, instead of per hour, per task
>> basis?
>>
>> > > May be find out all the TaskInstances that needs to be mark succeeded
>>
>> > > and then just run a sql?
>>
>> > >
>>
>> > > I may not be aware of lot of things here and very possible I am
>>
>> > > assuming a lot of things, incorrectly.
>>
>> > > Please feel free to correct me.
>>
>> > >
>>
>> > >
>>
>> > > Thanks,
>>
>> > > Harish
>>
>> > >
>>
>> >
>>
>>
>>
>
>

Re: Performance: backfill --mark_success

Posted by Maxime Beauchemin <ma...@gmail.com>.

Oh thanks for pointing this out, I just did a round of review on that PR.

While we have people's attention around backfill on this thread, I'd love
to introduce the new term "scheduler catchup" as something distinct to
`backfill`, at least until we get single code path for both operations.

Max

On Tue, Dec 6, 2016 at 8:56 AM, Bolke de Bruin <bd...@gmail.com> wrote:

> There is pr out for not having backfills at all specified at the dag level
> as well.
>
>
>
> *Van: *Maxime Beauchemin <ma...@gmail.com>
> *Verzonden: *dinsdag 6 december 2016 16:54
> *Aan: *dev@airflow.incubator.apache.org
> *CC: *harish.singh22@gmail.com
> *Onderwerp: *Re: Performance: backfill --mark_success
>
>
>
> The backfill `mark_success` logic could really be optimized by not relying
>
> on `airflow run --mark_success` by altering the database state directly
>
> instead of actually triggering tasks at all and relying on the backfill
>
> logic. Simply scope the set of task instances in scope, and merge (upsert)
>
> a `success` state to the db directly.
>
>
>
> To accelerate it though as it is today, you can reduce some of the
>
> heartbeats configurations (job_heartbeat_sec). It's usually desirable to
>
> have this setting lower in dev (say 5 seconds) than in production (30-60
>
> seconds).
>
>
>
> I suggest that better default that would be individually configurable for
>
> `heartrate` be in place for different types of jobs in `jobs.py`.
>
>
>
> Max
>
>
>
> On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <ll...@industrydive.com>
>
> wrote:
>
>
>
> > This is not that helpful of a message, but I also was having a problem
> with
>
> > `airflow backfill -m` on Airflow version 1.7.0 with it going super slow.
> In
>
> > the end I got around the necessity in that specific case, thinking that
> it
>
> > was broken in 1.7.0 re (
>
> > https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee),
>
> > but now that I am writing this and triangulating 1.7.0's release and that
>
> > gitter comment, it doesn't make sense. I'll give it another go.
>
> >
>
> > On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <
> vikas.malhotra08@gmail.com
>
> > >
>
> > wrote:
>
> >
>
> > > Hello Harish,
>
> > >
>
> > > Based on our understanding of Python Multiprocessing, a task instance
>
> > gets
>
> > > a record in underlying database after there is an explicit call to
>
> > airflow
>
> > > from that library (using Local Executor). So, I might be wrong, but you
>
> > > won't find a record in database until and unless that task instance has
>
> > got
>
> > > initiated. I might be wrong in our assumptions and would love to be
>
> > > corrected if that's the case.
>
> > >
>
> > > We have been using latest only operator and it's seems to be working
> well
>
> > > for skipping tasks if they are not current (basically avoiding backfill
>
> > by
>
> > > marking all tasks below the latest only operator as skipped). It's
>
> > present
>
> > > in master branch as of now and I would recommend you to look at that
>
> > > operator for backfill.
>
> > >
>
> > > Thanks!
>
> > > Vikas
>
> > >
>
> > > On Dec 4, 2016 5:23 AM, "harish singh" <ha...@gmail.com>
> wrote:
>
> > >
>
> > > Hi all,
>
> > >
>
> > > We have been running Airflow in our production for over 8-9 months now.
>
> > > I know there is a separate thread in place for Airflow 2.0.
>
> > > But I was not sure if any of the prior version has this fixed.  If
> not, I
>
> > > will add this to the other email thread for 2.0.
>
> > >
>
> > > When I run airflow backfill with "-m"  (Mark jobs as succeeded without
>
> > > running them) ,
>
> > > is there a way to optimize this call?
>
> > >
>
> > > For example:
>
> > > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00
>
> > -m
>
> > >
>
> > > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
>
> > > Essentially, Marking the jobs as succeeded without running them.
>
> > >
>
> > > It has ben more than an hour and the backfill has managed to reach only
>
> > > upto 2nd Nov.
>
> > > This seems to be very slow when there is no need to even run the tasks.
>
> > >
>
> > >
>
> > > I am running Airflow 1.7.0:
>
> > > These are my related configuration settings:
>
> > >
>
> > > parallelism = 50
>
> > > dag_concurrency = 20
>
> > > max_active_runs_per_dag = 8
>
> > >
>
> > > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
>
> > > running as scheduled with start_date of 2016-11-01T00:00:00
>
> > >
>
> > > My question is, since I am only Marking the jobs as "succeeded"
>
> > > without running them,
>
> > > can this be done over 1 sql query, instead of per hour, per task basis?
>
> > > May be find out all the TaskInstances that needs to be mark succeeded
>
> > > and then just run a sql?
>
> > >
>
> > > I may not be aware of lot of things here and very possible I am
>
> > > assuming a lot of things, incorrectly.
>
> > > Please feel free to correct me.
>
> > >
>
> > >
>
> > > Thanks,
>
> > > Harish
>
> > >
>
> >
>
>
>

RE: Performance: backfill --mark_success

Posted by Bolke de Bruin <bd...@gmail.com>.

There is pr out for not having backfills at all specified at the dag level as well. 

Van: Maxime Beauchemin
Verzonden: dinsdag 6 december 2016 16:54
Aan: dev@airflow.incubator.apache.org
CC: harish.singh22@gmail.com
Onderwerp: Re: Performance: backfill --mark_success

The backfill `mark_success` logic could really be optimized by not relying
on `airflow run --mark_success` by altering the database state directly
instead of actually triggering tasks at all and relying on the backfill
logic. Simply scope the set of task instances in scope, and merge (upsert)
a `success` state to the db directly.

To accelerate it though as it is today, you can reduce some of the
heartbeats configurations (job_heartbeat_sec). It's usually desirable to
have this setting lower in dev (say 5 seconds) than in production (30-60
seconds).

I suggest that better default that would be individually configurable for
`heartrate` be in place for different types of jobs in `jobs.py`.

Max

On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <ll...@industrydive.com>
wrote:

> This is not that helpful of a message, but I also was having a problem with
> `airflow backfill -m` on Airflow version 1.7.0 with it going super slow. In
> the end I got around the necessity in that specific case, thinking that it
> was broken in 1.7.0 re (
> https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee),
> but now that I am writing this and triangulating 1.7.0's release and that
> gitter comment, it doesn't make sense. I'll give it another go.
>
> On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <vikas.malhotra08@gmail.com
> >
> wrote:
>
> > Hello Harish,
> >
> > Based on our understanding of Python Multiprocessing, a task instance
> gets
> > a record in underlying database after there is an explicit call to
> airflow
> > from that library (using Local Executor). So, I might be wrong, but you
> > won't find a record in database until and unless that task instance has
> got
> > initiated. I might be wrong in our assumptions and would love to be
> > corrected if that's the case.
> >
> > We have been using latest only operator and it's seems to be working well
> > for skipping tasks if they are not current (basically avoiding backfill
> by
> > marking all tasks below the latest only operator as skipped). It's
> present
> > in master branch as of now and I would recommend you to look at that
> > operator for backfill.
> >
> > Thanks!
> > Vikas
> >
> > On Dec 4, 2016 5:23 AM, "harish singh" <ha...@gmail.com> wrote:
> >
> > Hi all,
> >
> > We have been running Airflow in our production for over 8-9 months now.
> > I know there is a separate thread in place for Airflow 2.0.
> > But I was not sure if any of the prior version has this fixed.  If not, I
> > will add this to the other email thread for 2.0.
> >
> > When I run airflow backfill with "-m"  (Mark jobs as succeeded without
> > running them) ,
> > is there a way to optimize this call?
> >
> > For example:
> > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00
> -m
> >
> > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
> > Essentially, Marking the jobs as succeeded without running them.
> >
> > It has ben more than an hour and the backfill has managed to reach only
> > upto 2nd Nov.
> > This seems to be very slow when there is no need to even run the tasks.
> >
> >
> > I am running Airflow 1.7.0:
> > These are my related configuration settings:
> >
> > parallelism = 50
> > dag_concurrency = 20
> > max_active_runs_per_dag = 8
> >
> > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
> > running as scheduled with start_date of 2016-11-01T00:00:00
> >
> > My question is, since I am only Marking the jobs as "succeeded"
> > without running them,
> > can this be done over 1 sql query, instead of per hour, per task basis?
> > May be find out all the TaskInstances that needs to be mark succeeded
> > and then just run a sql?
> >
> > I may not be aware of lot of things here and very possible I am
> > assuming a lot of things, incorrectly.
> > Please feel free to correct me.
> >
> >
> > Thanks,
> > Harish
> >
>

Re: Performance: backfill --mark_success

Posted by Maxime Beauchemin <ma...@gmail.com>.

The backfill `mark_success` logic could really be optimized by not relying
on `airflow run --mark_success` by altering the database state directly
instead of actually triggering tasks at all and relying on the backfill
logic. Simply scope the set of task instances in scope, and merge (upsert)
a `success` state to the db directly.

To accelerate it though as it is today, you can reduce some of the
heartbeats configurations (job_heartbeat_sec). It's usually desirable to
have this setting lower in dev (say 5 seconds) than in production (30-60
seconds).

I suggest that better default that would be individually configurable for
`heartrate` be in place for different types of jobs in `jobs.py`.

Max

On Mon, Dec 5, 2016 at 12:39 PM, Laura Lorenz <ll...@industrydive.com>
wrote:

> This is not that helpful of a message, but I also was having a problem with
> `airflow backfill -m` on Airflow version 1.7.0 with it going super slow. In
> the end I got around the necessity in that specific case, thinking that it
> was broken in 1.7.0 re (
> https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee),
> but now that I am writing this and triangulating 1.7.0's release and that
> gitter comment, it doesn't make sense. I'll give it another go.
>
> On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <vikas.malhotra08@gmail.com
> >
> wrote:
>
> > Hello Harish,
> >
> > Based on our understanding of Python Multiprocessing, a task instance
> gets
> > a record in underlying database after there is an explicit call to
> airflow
> > from that library (using Local Executor). So, I might be wrong, but you
> > won't find a record in database until and unless that task instance has
> got
> > initiated. I might be wrong in our assumptions and would love to be
> > corrected if that's the case.
> >
> > We have been using latest only operator and it's seems to be working well
> > for skipping tasks if they are not current (basically avoiding backfill
> by
> > marking all tasks below the latest only operator as skipped). It's
> present
> > in master branch as of now and I would recommend you to look at that
> > operator for backfill.
> >
> > Thanks!
> > Vikas
> >
> > On Dec 4, 2016 5:23 AM, "harish singh" <ha...@gmail.com> wrote:
> >
> > Hi all,
> >
> > We have been running Airflow in our production for over 8-9 months now.
> > I know there is a separate thread in place for Airflow 2.0.
> > But I was not sure if any of the prior version has this fixed.  If not, I
> > will add this to the other email thread for 2.0.
> >
> > When I run airflow backfill with "-m"  (Mark jobs as succeeded without
> > running them) ,
> > is there a way to optimize this call?
> >
> > For example:
> > airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00
> -m
> >
> > Here, I am running backfill for a month (from 1st Nov to 1st Dec).
> > Essentially, Marking the jobs as succeeded without running them.
> >
> > It has ben more than an hour and the backfill has managed to reach only
> > upto 2nd Nov.
> > This seems to be very slow when there is no need to even run the tasks.
> >
> >
> > I am running Airflow 1.7.0:
> > These are my related configuration settings:
> >
> > parallelism = 50
> > dag_concurrency = 20
> > max_active_runs_per_dag = 8
> >
> > Also, I have around 9 Dags running (all Hourly). The other 8 dags are
> > running as scheduled with start_date of 2016-11-01T00:00:00
> >
> > My question is, since I am only Marking the jobs as "succeeded"
> > without running them,
> > can this be done over 1 sql query, instead of per hour, per task basis?
> > May be find out all the TaskInstances that needs to be mark succeeded
> > and then just run a sql?
> >
> > I may not be aware of lot of things here and very possible I am
> > assuming a lot of things, incorrectly.
> > Please feel free to correct me.
> >
> >
> > Thanks,
> > Harish
> >
>

Re: Performance: backfill --mark_success

Posted by Laura Lorenz <ll...@industrydive.com>.

This is not that helpful of a message, but I also was having a problem with
`airflow backfill -m` on Airflow version 1.7.0 with it going super slow. In
the end I got around the necessity in that specific case, thinking that it
was broken in 1.7.0 re (
https://gitter.im/apache/incubator-airflow?at=56c3956c1f9833ef7c9ba8ee),
but now that I am writing this and triangulating 1.7.0's release and that
gitter comment, it doesn't make sense. I'll give it another go.

On Sun, Dec 4, 2016 at 7:50 AM, Vikas Malhotra <vi...@gmail.com>
wrote:

> Hello Harish,
>
> Based on our understanding of Python Multiprocessing, a task instance gets
> a record in underlying database after there is an explicit call to airflow
> from that library (using Local Executor). So, I might be wrong, but you
> won't find a record in database until and unless that task instance has got
> initiated. I might be wrong in our assumptions and would love to be
> corrected if that's the case.
>
> We have been using latest only operator and it's seems to be working well
> for skipping tasks if they are not current (basically avoiding backfill by
> marking all tasks below the latest only operator as skipped). It's present
> in master branch as of now and I would recommend you to look at that
> operator for backfill.
>
> Thanks!
> Vikas
>
> On Dec 4, 2016 5:23 AM, "harish singh" <ha...@gmail.com> wrote:
>
> Hi all,
>
> We have been running Airflow in our production for over 8-9 months now.
> I know there is a separate thread in place for Airflow 2.0.
> But I was not sure if any of the prior version has this fixed.  If not, I
> will add this to the other email thread for 2.0.
>
> When I run airflow backfill with "-m"  (Mark jobs as succeeded without
> running them) ,
> is there a way to optimize this call?
>
> For example:
> airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00 -m
>
> Here, I am running backfill for a month (from 1st Nov to 1st Dec).
> Essentially, Marking the jobs as succeeded without running them.
>
> It has ben more than an hour and the backfill has managed to reach only
> upto 2nd Nov.
> This seems to be very slow when there is no need to even run the tasks.
>
>
> I am running Airflow 1.7.0:
> These are my related configuration settings:
>
> parallelism = 50
> dag_concurrency = 20
> max_active_runs_per_dag = 8
>
> Also, I have around 9 Dags running (all Hourly). The other 8 dags are
> running as scheduled with start_date of 2016-11-01T00:00:00
>
> My question is, since I am only Marking the jobs as "succeeded"
> without running them,
> can this be done over 1 sql query, instead of per hour, per task basis?
> May be find out all the TaskInstances that needs to be mark succeeded
> and then just run a sql?
>
> I may not be aware of lot of things here and very possible I am
> assuming a lot of things, incorrectly.
> Please feel free to correct me.
>
>
> Thanks,
> Harish
>

Re: Performance: backfill --mark_success

Posted by Vikas Malhotra <vi...@gmail.com>.

Hello Harish,

Based on our understanding of Python Multiprocessing, a task instance gets
a record in underlying database after there is an explicit call to airflow
from that library (using Local Executor). So, I might be wrong, but you
won't find a record in database until and unless that task instance has got
initiated. I might be wrong in our assumptions and would love to be
corrected if that's the case.

We have been using latest only operator and it's seems to be working well
for skipping tasks if they are not current (basically avoiding backfill by
marking all tasks below the latest only operator as skipped). It's present
in master branch as of now and I would recommend you to look at that
operator for backfill.

Thanks!
Vikas

On Dec 4, 2016 5:23 AM, "harish singh" <ha...@gmail.com> wrote:

Hi all,

We have been running Airflow in our production for over 8-9 months now.
I know there is a separate thread in place for Airflow 2.0.
But I was not sure if any of the prior version has this fixed.  If not, I
will add this to the other email thread for 2.0.

When I run airflow backfill with "-m"  (Mark jobs as succeeded without
running them) ,
is there a way to optimize this call?

For example:
airflow backfill TEST_DAG -s 2016-11-01T00:00:00 -e 2016-12-01T00:00:00 -m

Here, I am running backfill for a month (from 1st Nov to 1st Dec).
Essentially, Marking the jobs as succeeded without running them.

It has ben more than an hour and the backfill has managed to reach only
upto 2nd Nov.
This seems to be very slow when there is no need to even run the tasks.


I am running Airflow 1.7.0:
These are my related configuration settings:

parallelism = 50
dag_concurrency = 20
max_active_runs_per_dag = 8

Also, I have around 9 Dags running (all Hourly). The other 8 dags are
running as scheduled with start_date of 2016-11-01T00:00:00

My question is, since I am only Marking the jobs as "succeeded"
without running them,
can this be done over 1 sql query, instead of per hour, per task basis?
May be find out all the TaskInstances that needs to be mark succeeded
and then just run a sql?

I may not be aware of lot of things here and very possible I am
assuming a lot of things, incorrectly.
Please feel free to correct me.


Thanks,
Harish