You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Maxime Beauchemin <ma...@gmail.com> on 2018/05/24 17:26:41 UTC

Is `airflow backfill` disfunctional?

So I'm running a backfill for what feels like the first time in years using
a simple `airflow backfill --local` commands.

First I start getting a ton of `logging.info` of each tasks that cannot be
started just yet at every tick flooding my terminal with the keyword
`FAILED` in it, looking like a million of lines like this one:

[2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met for
<TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>,
dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
quires all upstream tasks to have succeeded, but found 1 non-success(es).
upstream_tasks_state={'successes': 0L, 'failed': 0L, 'upstream_failed': 0L,
'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']

Good thing I triggered 1 month and not 2 years like I actually need, just
the logs here would be "big data". Now I'm unclear whether there's anything
actually running or if I did something wrong, so I decide to kill the
process so I can set a smaller date range and get a better picture of
what's up.

I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a note
that I'll need to find that log-flooding line and demote it to DEBUG in a
quick PR, no biggy.

Now I restart with just a single schedule, and get an error `Dag {some_dag}
has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could just
pickup where it left off. Maybe I need to run an `airflow clear` command
and restart? Ok, ran my clear command, same error is showing up. Dead end.

Maybe there is some new `airflow clear --reset-dagruns` option? Doesn't
look like it... Maybe `airflow backfill` has some new switches to pick up
where it left off? Can't find it. Am I supposed to clear the DAG Runs
manually in the UI?  This is a pre-production, in-development DAG, so it's
not on the production web server. Am I supposed to fire up my own web
server to go and manually handle the backfill-related DAG Runs? Cannot to
my staging MySQL and do manually clear some DAG runs?

So. Fire up a web server, navigate to my dag_id, delete the DAG runs, it
appears I can finally start over.

Next thought was: "Alright looks like I need to go Linus on the mailing
list".

What am I missing? I'm really hoping these issues specific to 1.8.2!

Backfilling is core to Airflow and should work very well. I want to restate
some reqs for Airflow backfill:
* when failing / interrupted, it should seamlessly be able to pickup where
it left off
* terminal logging at the INFO level should be a clear, human consumable,
indicator of progress
* backfill-related operations (including restarts) should be doable through
CLI interactions, and not require web server interactions as the typical
sandbox (dev environment) shouldn't assume the existence of a web server

Let's fix this.

Max

Re: Is `airflow backfill` disfunctional?

Posted by Trent Robbins <ro...@gmail.com>.

I had a similar experience but don't remember the details - it was
necessary to delete all dag runs and tasks for items you wanted to
backfill. We probably could have dropped those database rows but did not
try. This was primarily for when there were connection issues or input
files missing that were then later available for the dag to process.

Trent

On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> So I'm running a backfill for what feels like the first time in years using
> a simple `airflow backfill --local` commands.
>
> First I start getting a ton of `logging.info` of each tasks that cannot be
> started just yet at every tick flooding my terminal with the keyword
> `FAILED` in it, looking like a million of lines like this one:
>
> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met for
> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>,
> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
> quires all upstream tasks to have succeeded, but found 1 non-success(es).
> upstream_tasks_state={'successes': 0L, 'failed': 0L, 'upstream_failed':
> 0L,
> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
>
> Good thing I triggered 1 month and not 2 years like I actually need, just
> the logs here would be "big data". Now I'm unclear whether there's anything
> actually running or if I did something wrong, so I decide to kill the
> process so I can set a smaller date range and get a better picture of
> what's up.
>
> I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a note
> that I'll need to find that log-flooding line and demote it to DEBUG in a
> quick PR, no biggy.
>
> Now I restart with just a single schedule, and get an error `Dag {some_dag}
> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could just
> pickup where it left off. Maybe I need to run an `airflow clear` command
> and restart? Ok, ran my clear command, same error is showing up. Dead end.
>
> Maybe there is some new `airflow clear --reset-dagruns` option? Doesn't
> look like it... Maybe `airflow backfill` has some new switches to pick up
> where it left off? Can't find it. Am I supposed to clear the DAG Runs
> manually in the UI?  This is a pre-production, in-development DAG, so it's
> not on the production web server. Am I supposed to fire up my own web
> server to go and manually handle the backfill-related DAG Runs? Cannot to
> my staging MySQL and do manually clear some DAG runs?
>
> So. Fire up a web server, navigate to my dag_id, delete the DAG runs, it
> appears I can finally start over.
>
> Next thought was: "Alright looks like I need to go Linus on the mailing
> list".
>
> What am I missing? I'm really hoping these issues specific to 1.8.2!
>
> Backfilling is core to Airflow and should work very well. I want to restate
> some reqs for Airflow backfill:
> * when failing / interrupted, it should seamlessly be able to pickup where
> it left off
> * terminal logging at the INFO level should be a clear, human consumable,
> indicator of progress
> * backfill-related operations (including restarts) should be doable through
> CLI interactions, and not require web server interactions as the typical
> sandbox (dev environment) shouldn't assume the existence of a web server
>
> Let's fix this.
>
> Max
>

Re: Is `airflow backfill` disfunctional?

Posted by Grant Nicholas <gr...@u.northwestern.edu>.

+1 on the backfill CLI command being a wrapper around submitting a job to
the REST API.

Since backfills run client-side as a CLI command, if something goes wrong
on that node temporarily then the backfill will get killed and never
restart. When a backfill dies over the night and you have to restart it in
the morning it is super painful knowing you wasted a bunch of time.



On Sun, Apr 14, 2019 at 1:38 PM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Good points James,
>
> Personally, I never use the CLI backfilling, and also recommend colleagues
> not to use it because of the points that you mention. I also resort to the
> poor man's backfilling (clearing the future and past in the UI).
>
> I'd rather get rid of the CLI, and would like to see the possibility to
> submit a backfill job through the REST API. In this case, it can be part of
> the web UI, but you could also write a CLI tool if that is your thing :-)
>
> Cheers, Fokko
>
> Op za 13 apr. 2019 om 23:26 schreef Maxime Beauchemin <
> maximebeauchemin@gmail.com>:
>
> > +1, backfilling, and related "subdag surgeries" are core to a data
> > engineer's job, and great tooling around this is super important.
> Backfill
> > needs more TLC!
> >
> > Max
> >
> > On Fri, Apr 12, 2019 at 11:48 PM Chao-Han Tsai <mi...@gmail.com>
> > wrote:
> >
> > > +1 on improving backfill.
> > >
> > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > > enough to be unreadable.
> > >
> > >
> > > I agree that backfill is currently too verbose. It simply logs to many
> > > things and it is hard to read. Often times, I only care about the
> number
> > of
> > > tasks/dagruns that are in-progress/finished/not started. I had a PR
> > > <https://github.com/apache/airflow/pull/3478> that implements a
> progress
> > > bar for backfill but was not able to finish. Probably something that
> can
> > > help improve the backfill experience.
> > >
> > > - The backfill exceeded safe concurrency limits for the cluster and
> > > > could've easily brought it down if I'd left it running.
> > >
> > >
> > > Btw. backfill now respects pool limitation but we should probably
> looking
> > > into making it respect concurrency limit.
> > >
> > > Chao-Han
> > >
> > > >
> > > >
> > >
> > >
> > > On Mon, Mar 4, 2019 at 12:35 PM James Meickle
> > > <jm...@quantopian.com.invalid> wrote:
> > >
> > > > This is an old thread, but I wanted to bump it as I just had a really
> > bad
> > > > experience using backfill. I'd been hesitant to even try backfills
> out
> > > > given what I've read about it, so I've just relied on the UI to
> "Clear"
> > > > entire tasks. However, I wanted to give it a shot the "right" way.
> > > Issues I
> > > > ran into:
> > > >
> > > > - The dry run flag didn't give good feedback about which dagruns and
> > task
> > > > instances will be affected (and is very easy to typo as "--dry-run")
> > > >
> > > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > > enough to be unreadable.
> > > >
> > > > - The backfill exceeded safe concurrency limits for the cluster and
> > > > could've easily brought it down if I'd left it running.
> > > >
> > > > - Tasks in the backfill were executed out of order despite the tasks
> > > having
> > > > `depends_on_past`
> > > >
> > > > - The backfill converted all existing DAGRuns to be backfill runs
> that
> > > the
> > > > scheduler later ignored, which is not how I would've expected this to
> > > work
> > > > (nor was it indicated in the dry run)
> > > >
> > > > I ended up having to do manual recovery work in the database to turn
> > the
> > > > "backfill" runs back into scheduler runs, and then switch to using
> > > `airflow
> > > > clear`. I'm a heavy Airflow user and this took me an hour; it
> would've
> > > been
> > > > much worse for anyone else on my team.
> > > >
> > > > I don't have any specific suggestions here other than to confirm that
> > > this
> > > > feature needs an overhaul if it's to be recommended to anyone.
> > > >
> > > > On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <
> > > > maximebeauchemin@gmail.com>
> > > > wrote:
> > > >
> > > > > Ash I don't see how this could happen unless maybe the node doing
> the
> > > > > backfill is using another metadata database.
> > > > >
> > > > > In general we recommend for people to run --local backfills and
> have
> > > the
> > > > > default/sandbox template for `airflow.cfg` use a LocalExecutor with
> > > > > reasonable parallelism to make that behavior the default.
> > > > >
> > > > > Given the [not-so-great] state of backfill, I'm guessing many have
> > been
> > > > > using the scheduler to do backfills. From that regard it would be
> > nice
> > > to
> > > > > have CLI commands to generate dagruns or alter the state of
> existing
> > > ones
> > > > >
> > > > > Max
> > > > >
> > > > > On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> > > > > ash_airflowlist@firemirror.com> wrote:
> > > > >
> > > > > > Somewhat related to this, but likely a different issue:
> > > > > >
> > > > > > I've just had a case where a long (7hours) running backfill task
> > > ended
> > > > up
> > > > > > running twice somehow. We're using Celery so this might be
> related
> > to
> > > > > some
> > > > > > sort of Celery visibility timeout, but I haven't had a chance to
> be
> > > > able
> > > > > to
> > > > > > dig in to it in detail - it's 5pm on a Friday :D
> > > > > >
> > > > > > Has anyone else noticed anything similar?
> > > > > >
> > > > > > -ash
> > > > > >
> > > > > >
> > > > > > > On 8 Jun 2018, at 01:22, Tao Feng <fe...@gmail.com> wrote:
> > > > > > >
> > > > > > > Thanks everyone for the feedback especially on the background
> for
> > > > > > backfill.
> > > > > > > After reading the discussion, I think it would be safest to
> add a
> > > > flag
> > > > > > for
> > > > > > > auto rerun failed tasks for backfill with default to be false.
> I
> > > have
> > > > > > > updated the pr accordingly.
> > > > > > >
> > > > > > > Thanks a lot,
> > > > > > > -Tao
> > > > > > >
> > > > > > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > > > > > mark.whitfield@nytimes.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I've been doing some work setting up a large, collaborative
> > > Airflow
> > > > > > >> pipeline with a group that makes heavy use of backfills, and
> > have
> > > > been
> > > > > > >> encountering a lot of these issues myself.
> > > > > > >>
> > > > > > >> Other gripes:
> > > > > > >>
> > > > > > >> Backfills do not obey concurrency pool restrictions. We had
> been
> > > > > making
> > > > > > >> heavy use of SubDAGs and using concurrency pools to prevent
> > > > deadlocks
> > > > > > (why
> > > > > > >> does the SubDAG itself even need to occupy a concurrency slot
> if
> > > > none
> > > > > of
> > > > > > >> its constituent tasks are running?), but this quickly became
> > > > untenable
> > > > > > when
> > > > > > >> using backfills and we were forced to mostly abandon SubDAGs.
> > > > > > >>
> > > > > > >> Backfills do use DagRuns now, which is a big improvement.
> > However,
> > > > > it's
> > > > > > a
> > > > > > >> common use case for us to add new tasks to a DAG and backfill
> > to a
> > > > > date
> > > > > > >> specific to that task. When we do this, the BackfillJob will
> > pick
> > > up
> > > > > > >> previous backfill DagRuns and re-use them, which is mostly
> nice
> > > > > because
> > > > > > it
> > > > > > >> keeps the Tree view neatly organized in the UI. However, it
> does
> > > not
> > > > > > reset
> > > > > > >> the start time of the DagRun when it does this. Combined with
> a
> > > > > > DAG-level
> > > > > > >> timeout, this means that the backfill job will activate a
> > DagRun,
> > > > but
> > > > > > then
> > > > > > >> the run will immediately time out (since it still thinks it's
> > been
> > > > > > running
> > > > > > >> since the previous backfill). This will cause tasks to
> deadlock
> > > > > > spuriously,
> > > > > > >> making backfills extremely cumbersome to carry out.
> > > > > > >>
> > > > > > >> *Mark Whitfield*
> > > > > > >> Data Scientist
> > > > > > >> New York Times
> > > > > > >>
> > > > > > >>
> > > > > > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > > > > > >> maximebeauchemin@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Thanks for the input, this is helpful.
> > > > > > >>>
> > > > > > >>> To add to the list, there's some complexity around
> concurrency
> > > > > > management
> > > > > > >>> and multiple executors:
> > > > > > >>> I just hit this thing where backfill doesn't check DAG-level
> > > > > > concurrency,
> > > > > > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > > > > > concurrency
> > > > > > >>> limit and exits. Right after backfill reschedules right away
> > and
> > > so
> > > > > on,
> > > > > > >>> burning a bunch of CPU doing nothing. In this specific case
> it
> > > > seems
> > > > > > like
> > > > > > >>> `airflow run` should skip that specific check when in the
> > context
> > > > of
> > > > > a
> > > > > > >>> backfill.
> > > > > > >>>
> > > > > > >>> Max
> > > > > > >>>
> > > > > > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <
> > bdbruin@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Thinking out loud here, because it is a while back that I
> did
> > > work
> > > > > on
> > > > > > >>>> backfills. There were some real issues with backfills:
> > > > > > >>>>
> > > > > > >>>> 1. Tasks were running in non deterministic order ending up
> in
> > > > > regular
> > > > > > >>>> deadlocks
> > > > > > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max
> > dag
> > > > > runs
> > > > > > >>>> could not be enforced. Ui could really display it, lots of
> > minor
> > > > > other
> > > > > > >>>> issues because of it.
> > > > > > >>>> 3. Behavior was different from the scheduler, while
> > > > subdagoperators
> > > > > > >>>> particularly make use of backfills at the moment.
> > > > > > >>>>
> > > > > > >>>> I think with 3 the behavior you are observing crept in. And
> > > given
> > > > 3
> > > > > I
> > > > > > >>>> would argue a consistent behavior between the scheduler and
> > the
> > > > > > >> backfill
> > > > > > >>>> mechanism is still paramount. Thus we should explicitly
> clear
> > > > tasks
> > > > > > >> from
> > > > > > >>>> failed if we want to rerun them. This at least until we move
> > the
> > > > > > >>>> subdagoperator out of backfill and into the scheduler (which
> > is
> > > > > > >> actually
> > > > > > >>>> not too hard). Also we need those command line options
> anyway.
> > > > > > >>>>
> > > > > > >>>> Bolke
> > > > > > >>>>
> > > > > > >>>> Verstuurd vanaf mijn iPad
> > > > > > >>>>
> > > > > > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > > > > > >> scott.halgrim@zapier.com
> > > > > > >>> .INVALID>
> > > > > > >>>> het volgende geschreven:
> > > > > > >>>>>
> > > > > > >>>>> The request was for opposition, but I’d like to weigh in on
> > the
> > > > > side
> > > > > > >> of
> > > > > > >>>> “it’s a better behavior [to have failed tasks re-run when
> > > cleared
> > > > > in a
> > > > > > >>>> backfill"
> > > > > > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > > > > >>>> maximebeauchemin@gmail.com>, wrote:
> > > > > > >>>>>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> > > > > > >>> bdbruin@gmail.com>
> > > > > > >>>> I
> > > > > > >>>>>> think you may have some context on why this may have
> changed
> > > at
> > > > > some
> > > > > > >>>> point.
> > > > > > >>>>>> I'm assuming that when DagRun handling was added to the
> > > backfill
> > > > > > >>> logic,
> > > > > > >>>> the
> > > > > > >>>>>> behavior just happened to change to what it is now.
> > > > > > >>>>>>
> > > > > > >>>>>> Any opposition in moving back towards re-running failed
> > tasks
> > > > when
> > > > > > >>>> starting
> > > > > > >>>>>> a backfill? I think it's a better behavior, though it's a
> > > change
> > > > > in
> > > > > > >>>>>> behavior that we should mention in UPDATE.md.
> > > > > > >>>>>>
> > > > > > >>>>>> One of our goals is to make sure that a failed or killed
> > > > backfill
> > > > > > >> can
> > > > > > >>> be
> > > > > > >>>>>> restarted and just seamlessly pick up where it left off.
> > > > > > >>>>>>
> > > > > > >>>>>> Max
> > > > > > >>>>>>
> > > > > > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <
> > fengtao04@gmail.com
> > > >
> > > > > > >> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>> After discussing with Max, we think it would be great if
> > > > `airflow
> > > > > > >>>> backfill`
> > > > > > >>>>>>> could be able to auto pick up and rerun those failed
> tasks.
> > > > > > >>> Currently,
> > > > > > >>>> it
> > > > > > >>>>>>> will throw exceptions(
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>
> > > > > > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > > > > > >> low/jobs.py#L2489
> > > > > > >>>>>>> )
> > > > > > >>>>>>> without rerunning the failed tasks.
> > > > > > >>>>>>>
> > > > > > >>>>>>> But since it broke some of the previous assumptions for
> > > > backfill,
> > > > > > >> we
> > > > > > >>>> would
> > > > > > >>>>>>> like to get some feedback and see if anyone has any
> > > concerns(pr
> > > > > > >> could
> > > > > > >>>> be
> > > > > > >>>>>>> found at https://github.com/apache/incu
> > > > > > >> bator-airflow/pull/3464/files
> > > > > > >>> ).
> > > > > > >>>>>>>
> > > > > > >>>>>>> Thanks,
> > > > > > >>>>>>> -Tao
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > > > > >>>>>>> maximebeauchemin@gmail.com> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> So I'm running a backfill for what feels like the first
> > time
> > > > in
> > > > > > >>> years
> > > > > > >>>>>>> using
> > > > > > >>>>>>>> a simple `airflow backfill --local` commands.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> First I start getting a ton of `logging.info` of each
> > tasks
> > > > > that
> > > > > > >>>> cannot
> > > > > > >>>>>>> be
> > > > > > >>>>>>>> started just yet at every tick flooding my terminal with
> > the
> > > > > > >> keyword
> > > > > > >>>>>>>> `FAILED` in it, looking like a million of lines like
> this
> > > one:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO -
> > > Dependencies
> > > > > not
> > > > > > >>> met
> > > > > > >>>>>>> for
> > > > > > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > > > > > >>> [scheduled]>,
> > > > > > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > > > > > >> 'all_success'
> > > > > > >>> re
> > > > > > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > > > > > >>>> non-success(es).
> > > > > > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > > > > >>>> 'upstream_failed':
> > > > > > >>>>>>>> 0L,
> > > > > > >>>>>>>> 'skipped': 0L, 'done': 0L},
> upstream_task_ids=['some_other
> > > > > > >> _task_id']
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Good thing I triggered 1 month and not 2 years like I
> > > actually
> > > > > > >> need,
> > > > > > >>>> just
> > > > > > >>>>>>>> the logs here would be "big data". Now I'm unclear
> whether
> > > > > there's
> > > > > > >>>>>>> anything
> > > > > > >>>>>>>> actually running or if I did something wrong, so I
> decide
> > to
> > > > > kill
> > > > > > >>> the
> > > > > > >>>>>>>> process so I can set a smaller date range and get a
> better
> > > > > picture
> > > > > > >>> of
> > > > > > >>>>>>>> what's up.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just
> INFO.
> > > So I
> > > > > > >> take
> > > > > > >>> a
> > > > > > >>>>>>> note
> > > > > > >>>>>>>> that I'll need to find that log-flooding line and demote
> > it
> > > to
> > > > > > >> DEBUG
> > > > > > >>>> in a
> > > > > > >>>>>>>> quick PR, no biggy.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Now I restart with just a single schedule, and get an
> > error
> > > > `Dag
> > > > > > >>>>>>> {some_dag}
> > > > > > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish
> > > > backfill
> > > > > > >>> could
> > > > > > >>>>>>> just
> > > > > > >>>>>>>> pickup where it left off. Maybe I need to run an
> `airflow
> > > > clear`
> > > > > > >>>> command
> > > > > > >>>>>>>> and restart? Ok, ran my clear command, same error is
> > showing
> > > > up.
> > > > > > >>> Dead
> > > > > > >>>>>>> end.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
> > > > option?
> > > > > > >>>> Doesn't
> > > > > > >>>>>>>> look like it... Maybe `airflow backfill` has some new
> > > switches
> > > > > to
> > > > > > >>>> pick up
> > > > > > >>>>>>>> where it left off? Can't find it. Am I supposed to clear
> > the
> > > > DAG
> > > > > > >>> Runs
> > > > > > >>>>>>>> manually in the UI? This is a pre-production,
> > in-development
> > > > > DAG,
> > > > > > >> so
> > > > > > >>>>>>> it's
> > > > > > >>>>>>>> not on the production web server. Am I supposed to fire
> up
> > > my
> > > > > own
> > > > > > >>> web
> > > > > > >>>>>>>> server to go and manually handle the backfill-related
> DAG
> > > > Runs?
> > > > > > >>>> Cannot to
> > > > > > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete
> > the
> > > > DAG
> > > > > > >>> runs,
> > > > > > >>>> it
> > > > > > >>>>>>>> appears I can finally start over.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Next thought was: "Alright looks like I need to go Linus
> > on
> > > > the
> > > > > > >>>> mailing
> > > > > > >>>>>>>> list".
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> What am I missing? I'm really hoping these issues
> specific
> > > to
> > > > > > >> 1.8.2!
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Backfilling is core to Airflow and should work very
> well.
> > I
> > > > want
> > > > > > >> to
> > > > > > >>>>>>> restate
> > > > > > >>>>>>>> some reqs for Airflow backfill:
> > > > > > >>>>>>>> * when failing / interrupted, it should seamlessly be
> able
> > > to
> > > > > > >> pickup
> > > > > > >>>>>>> where
> > > > > > >>>>>>>> it left off
> > > > > > >>>>>>>> * terminal logging at the INFO level should be a clear,
> > > human
> > > > > > >>>> consumable,
> > > > > > >>>>>>>> indicator of progress
> > > > > > >>>>>>>> * backfill-related operations (including restarts)
> should
> > be
> > > > > > >> doable
> > > > > > >>>>>>> through
> > > > > > >>>>>>>> CLI interactions, and not require web server
> interactions
> > as
> > > > the
> > > > > > >>>> typical
> > > > > > >>>>>>>> sandbox (dev environment) shouldn't assume the existence
> > of
> > > a
> > > > > web
> > > > > > >>>> server
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Let's fix this.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Max
> > > > > > >>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Chao-Han Tsai
> > >
> >
>

Re: Is `airflow backfill` disfunctional?

Posted by Maxime Beauchemin <ma...@gmail.com>.

Note that sometimes it can be convenient to run a backfill based on a
previous version or altered DAG. For example if logic has changed in the
repo but somehow need to re-run so earlier logic against some period in
2016, you may want to checkout an earlier commit and trigger a backfill
based on that logic.

Another use case would be if you're working on a brand new DAG, you may
want to run it on a month or 3 to plot / validate some data prior to
merging to master, ...

Assuming something like "DagFetcher", it'd be great for the backfill to be
remote, but allow to specify an alternate DAG artifact.

Max

On Sun, Apr 14, 2019 at 11:38 AM Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Good points James,
>
> Personally, I never use the CLI backfilling, and also recommend colleagues
> not to use it because of the points that you mention. I also resort to the
> poor man's backfilling (clearing the future and past in the UI).
>
> I'd rather get rid of the CLI, and would like to see the possibility to
> submit a backfill job through the REST API. In this case, it can be part of
> the web UI, but you could also write a CLI tool if that is your thing :-)
>
> Cheers, Fokko
>
> Op za 13 apr. 2019 om 23:26 schreef Maxime Beauchemin <
> maximebeauchemin@gmail.com>:
>
> > +1, backfilling, and related "subdag surgeries" are core to a data
> > engineer's job, and great tooling around this is super important.
> Backfill
> > needs more TLC!
> >
> > Max
> >
> > On Fri, Apr 12, 2019 at 11:48 PM Chao-Han Tsai <mi...@gmail.com>
> > wrote:
> >
> > > +1 on improving backfill.
> > >
> > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > > enough to be unreadable.
> > >
> > >
> > > I agree that backfill is currently too verbose. It simply logs to many
> > > things and it is hard to read. Often times, I only care about the
> number
> > of
> > > tasks/dagruns that are in-progress/finished/not started. I had a PR
> > > <https://github.com/apache/airflow/pull/3478> that implements a
> progress
> > > bar for backfill but was not able to finish. Probably something that
> can
> > > help improve the backfill experience.
> > >
> > > - The backfill exceeded safe concurrency limits for the cluster and
> > > > could've easily brought it down if I'd left it running.
> > >
> > >
> > > Btw. backfill now respects pool limitation but we should probably
> looking
> > > into making it respect concurrency limit.
> > >
> > > Chao-Han
> > >
> > > >
> > > >
> > >
> > >
> > > On Mon, Mar 4, 2019 at 12:35 PM James Meickle
> > > <jm...@quantopian.com.invalid> wrote:
> > >
> > > > This is an old thread, but I wanted to bump it as I just had a really
> > bad
> > > > experience using backfill. I'd been hesitant to even try backfills
> out
> > > > given what I've read about it, so I've just relied on the UI to
> "Clear"
> > > > entire tasks. However, I wanted to give it a shot the "right" way.
> > > Issues I
> > > > ran into:
> > > >
> > > > - The dry run flag didn't give good feedback about which dagruns and
> > task
> > > > instances will be affected (and is very easy to typo as "--dry-run")
> > > >
> > > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > > enough to be unreadable.
> > > >
> > > > - The backfill exceeded safe concurrency limits for the cluster and
> > > > could've easily brought it down if I'd left it running.
> > > >
> > > > - Tasks in the backfill were executed out of order despite the tasks
> > > having
> > > > `depends_on_past`
> > > >
> > > > - The backfill converted all existing DAGRuns to be backfill runs
> that
> > > the
> > > > scheduler later ignored, which is not how I would've expected this to
> > > work
> > > > (nor was it indicated in the dry run)
> > > >
> > > > I ended up having to do manual recovery work in the database to turn
> > the
> > > > "backfill" runs back into scheduler runs, and then switch to using
> > > `airflow
> > > > clear`. I'm a heavy Airflow user and this took me an hour; it
> would've
> > > been
> > > > much worse for anyone else on my team.
> > > >
> > > > I don't have any specific suggestions here other than to confirm that
> > > this
> > > > feature needs an overhaul if it's to be recommended to anyone.
> > > >
> > > > On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <
> > > > maximebeauchemin@gmail.com>
> > > > wrote:
> > > >
> > > > > Ash I don't see how this could happen unless maybe the node doing
> the
> > > > > backfill is using another metadata database.
> > > > >
> > > > > In general we recommend for people to run --local backfills and
> have
> > > the
> > > > > default/sandbox template for `airflow.cfg` use a LocalExecutor with
> > > > > reasonable parallelism to make that behavior the default.
> > > > >
> > > > > Given the [not-so-great] state of backfill, I'm guessing many have
> > been
> > > > > using the scheduler to do backfills. From that regard it would be
> > nice
> > > to
> > > > > have CLI commands to generate dagruns or alter the state of
> existing
> > > ones
> > > > >
> > > > > Max
> > > > >
> > > > > On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> > > > > ash_airflowlist@firemirror.com> wrote:
> > > > >
> > > > > > Somewhat related to this, but likely a different issue:
> > > > > >
> > > > > > I've just had a case where a long (7hours) running backfill task
> > > ended
> > > > up
> > > > > > running twice somehow. We're using Celery so this might be
> related
> > to
> > > > > some
> > > > > > sort of Celery visibility timeout, but I haven't had a chance to
> be
> > > > able
> > > > > to
> > > > > > dig in to it in detail - it's 5pm on a Friday :D
> > > > > >
> > > > > > Has anyone else noticed anything similar?
> > > > > >
> > > > > > -ash
> > > > > >
> > > > > >
> > > > > > > On 8 Jun 2018, at 01:22, Tao Feng <fe...@gmail.com> wrote:
> > > > > > >
> > > > > > > Thanks everyone for the feedback especially on the background
> for
> > > > > > backfill.
> > > > > > > After reading the discussion, I think it would be safest to
> add a
> > > > flag
> > > > > > for
> > > > > > > auto rerun failed tasks for backfill with default to be false.
> I
> > > have
> > > > > > > updated the pr accordingly.
> > > > > > >
> > > > > > > Thanks a lot,
> > > > > > > -Tao
> > > > > > >
> > > > > > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > > > > > mark.whitfield@nytimes.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> I've been doing some work setting up a large, collaborative
> > > Airflow
> > > > > > >> pipeline with a group that makes heavy use of backfills, and
> > have
> > > > been
> > > > > > >> encountering a lot of these issues myself.
> > > > > > >>
> > > > > > >> Other gripes:
> > > > > > >>
> > > > > > >> Backfills do not obey concurrency pool restrictions. We had
> been
> > > > > making
> > > > > > >> heavy use of SubDAGs and using concurrency pools to prevent
> > > > deadlocks
> > > > > > (why
> > > > > > >> does the SubDAG itself even need to occupy a concurrency slot
> if
> > > > none
> > > > > of
> > > > > > >> its constituent tasks are running?), but this quickly became
> > > > untenable
> > > > > > when
> > > > > > >> using backfills and we were forced to mostly abandon SubDAGs.
> > > > > > >>
> > > > > > >> Backfills do use DagRuns now, which is a big improvement.
> > However,
> > > > > it's
> > > > > > a
> > > > > > >> common use case for us to add new tasks to a DAG and backfill
> > to a
> > > > > date
> > > > > > >> specific to that task. When we do this, the BackfillJob will
> > pick
> > > up
> > > > > > >> previous backfill DagRuns and re-use them, which is mostly
> nice
> > > > > because
> > > > > > it
> > > > > > >> keeps the Tree view neatly organized in the UI. However, it
> does
> > > not
> > > > > > reset
> > > > > > >> the start time of the DagRun when it does this. Combined with
> a
> > > > > > DAG-level
> > > > > > >> timeout, this means that the backfill job will activate a
> > DagRun,
> > > > but
> > > > > > then
> > > > > > >> the run will immediately time out (since it still thinks it's
> > been
> > > > > > running
> > > > > > >> since the previous backfill). This will cause tasks to
> deadlock
> > > > > > spuriously,
> > > > > > >> making backfills extremely cumbersome to carry out.
> > > > > > >>
> > > > > > >> *Mark Whitfield*
> > > > > > >> Data Scientist
> > > > > > >> New York Times
> > > > > > >>
> > > > > > >>
> > > > > > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > > > > > >> maximebeauchemin@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Thanks for the input, this is helpful.
> > > > > > >>>
> > > > > > >>> To add to the list, there's some complexity around
> concurrency
> > > > > > management
> > > > > > >>> and multiple executors:
> > > > > > >>> I just hit this thing where backfill doesn't check DAG-level
> > > > > > concurrency,
> > > > > > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > > > > > concurrency
> > > > > > >>> limit and exits. Right after backfill reschedules right away
> > and
> > > so
> > > > > on,
> > > > > > >>> burning a bunch of CPU doing nothing. In this specific case
> it
> > > > seems
> > > > > > like
> > > > > > >>> `airflow run` should skip that specific check when in the
> > context
> > > > of
> > > > > a
> > > > > > >>> backfill.
> > > > > > >>>
> > > > > > >>> Max
> > > > > > >>>
> > > > > > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <
> > bdbruin@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>>
> > > > > > >>>> Thinking out loud here, because it is a while back that I
> did
> > > work
> > > > > on
> > > > > > >>>> backfills. There were some real issues with backfills:
> > > > > > >>>>
> > > > > > >>>> 1. Tasks were running in non deterministic order ending up
> in
> > > > > regular
> > > > > > >>>> deadlocks
> > > > > > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max
> > dag
> > > > > runs
> > > > > > >>>> could not be enforced. Ui could really display it, lots of
> > minor
> > > > > other
> > > > > > >>>> issues because of it.
> > > > > > >>>> 3. Behavior was different from the scheduler, while
> > > > subdagoperators
> > > > > > >>>> particularly make use of backfills at the moment.
> > > > > > >>>>
> > > > > > >>>> I think with 3 the behavior you are observing crept in. And
> > > given
> > > > 3
> > > > > I
> > > > > > >>>> would argue a consistent behavior between the scheduler and
> > the
> > > > > > >> backfill
> > > > > > >>>> mechanism is still paramount. Thus we should explicitly
> clear
> > > > tasks
> > > > > > >> from
> > > > > > >>>> failed if we want to rerun them. This at least until we move
> > the
> > > > > > >>>> subdagoperator out of backfill and into the scheduler (which
> > is
> > > > > > >> actually
> > > > > > >>>> not too hard). Also we need those command line options
> anyway.
> > > > > > >>>>
> > > > > > >>>> Bolke
> > > > > > >>>>
> > > > > > >>>> Verstuurd vanaf mijn iPad
> > > > > > >>>>
> > > > > > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > > > > > >> scott.halgrim@zapier.com
> > > > > > >>> .INVALID>
> > > > > > >>>> het volgende geschreven:
> > > > > > >>>>>
> > > > > > >>>>> The request was for opposition, but I’d like to weigh in on
> > the
> > > > > side
> > > > > > >> of
> > > > > > >>>> “it’s a better behavior [to have failed tasks re-run when
> > > cleared
> > > > > in a
> > > > > > >>>> backfill"
> > > > > > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > > > > >>>> maximebeauchemin@gmail.com>, wrote:
> > > > > > >>>>>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> > > > > > >>> bdbruin@gmail.com>
> > > > > > >>>> I
> > > > > > >>>>>> think you may have some context on why this may have
> changed
> > > at
> > > > > some
> > > > > > >>>> point.
> > > > > > >>>>>> I'm assuming that when DagRun handling was added to the
> > > backfill
> > > > > > >>> logic,
> > > > > > >>>> the
> > > > > > >>>>>> behavior just happened to change to what it is now.
> > > > > > >>>>>>
> > > > > > >>>>>> Any opposition in moving back towards re-running failed
> > tasks
> > > > when
> > > > > > >>>> starting
> > > > > > >>>>>> a backfill? I think it's a better behavior, though it's a
> > > change
> > > > > in
> > > > > > >>>>>> behavior that we should mention in UPDATE.md.
> > > > > > >>>>>>
> > > > > > >>>>>> One of our goals is to make sure that a failed or killed
> > > > backfill
> > > > > > >> can
> > > > > > >>> be
> > > > > > >>>>>> restarted and just seamlessly pick up where it left off.
> > > > > > >>>>>>
> > > > > > >>>>>> Max
> > > > > > >>>>>>
> > > > > > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <
> > fengtao04@gmail.com
> > > >
> > > > > > >> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>> After discussing with Max, we think it would be great if
> > > > `airflow
> > > > > > >>>> backfill`
> > > > > > >>>>>>> could be able to auto pick up and rerun those failed
> tasks.
> > > > > > >>> Currently,
> > > > > > >>>> it
> > > > > > >>>>>>> will throw exceptions(
> > > > > > >>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>
> > > > > > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > > > > > >> low/jobs.py#L2489
> > > > > > >>>>>>> )
> > > > > > >>>>>>> without rerunning the failed tasks.
> > > > > > >>>>>>>
> > > > > > >>>>>>> But since it broke some of the previous assumptions for
> > > > backfill,
> > > > > > >> we
> > > > > > >>>> would
> > > > > > >>>>>>> like to get some feedback and see if anyone has any
> > > concerns(pr
> > > > > > >> could
> > > > > > >>>> be
> > > > > > >>>>>>> found at https://github.com/apache/incu
> > > > > > >> bator-airflow/pull/3464/files
> > > > > > >>> ).
> > > > > > >>>>>>>
> > > > > > >>>>>>> Thanks,
> > > > > > >>>>>>> -Tao
> > > > > > >>>>>>>
> > > > > > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > > > > >>>>>>> maximebeauchemin@gmail.com> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> So I'm running a backfill for what feels like the first
> > time
> > > > in
> > > > > > >>> years
> > > > > > >>>>>>> using
> > > > > > >>>>>>>> a simple `airflow backfill --local` commands.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> First I start getting a ton of `logging.info` of each
> > tasks
> > > > > that
> > > > > > >>>> cannot
> > > > > > >>>>>>> be
> > > > > > >>>>>>>> started just yet at every tick flooding my terminal with
> > the
> > > > > > >> keyword
> > > > > > >>>>>>>> `FAILED` in it, looking like a million of lines like
> this
> > > one:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO -
> > > Dependencies
> > > > > not
> > > > > > >>> met
> > > > > > >>>>>>> for
> > > > > > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > > > > > >>> [scheduled]>,
> > > > > > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > > > > > >> 'all_success'
> > > > > > >>> re
> > > > > > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > > > > > >>>> non-success(es).
> > > > > > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > > > > >>>> 'upstream_failed':
> > > > > > >>>>>>>> 0L,
> > > > > > >>>>>>>> 'skipped': 0L, 'done': 0L},
> upstream_task_ids=['some_other
> > > > > > >> _task_id']
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Good thing I triggered 1 month and not 2 years like I
> > > actually
> > > > > > >> need,
> > > > > > >>>> just
> > > > > > >>>>>>>> the logs here would be "big data". Now I'm unclear
> whether
> > > > > there's
> > > > > > >>>>>>> anything
> > > > > > >>>>>>>> actually running or if I did something wrong, so I
> decide
> > to
> > > > > kill
> > > > > > >>> the
> > > > > > >>>>>>>> process so I can set a smaller date range and get a
> better
> > > > > picture
> > > > > > >>> of
> > > > > > >>>>>>>> what's up.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just
> INFO.
> > > So I
> > > > > > >> take
> > > > > > >>> a
> > > > > > >>>>>>> note
> > > > > > >>>>>>>> that I'll need to find that log-flooding line and demote
> > it
> > > to
> > > > > > >> DEBUG
> > > > > > >>>> in a
> > > > > > >>>>>>>> quick PR, no biggy.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Now I restart with just a single schedule, and get an
> > error
> > > > `Dag
> > > > > > >>>>>>> {some_dag}
> > > > > > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish
> > > > backfill
> > > > > > >>> could
> > > > > > >>>>>>> just
> > > > > > >>>>>>>> pickup where it left off. Maybe I need to run an
> `airflow
> > > > clear`
> > > > > > >>>> command
> > > > > > >>>>>>>> and restart? Ok, ran my clear command, same error is
> > showing
> > > > up.
> > > > > > >>> Dead
> > > > > > >>>>>>> end.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
> > > > option?
> > > > > > >>>> Doesn't
> > > > > > >>>>>>>> look like it... Maybe `airflow backfill` has some new
> > > switches
> > > > > to
> > > > > > >>>> pick up
> > > > > > >>>>>>>> where it left off? Can't find it. Am I supposed to clear
> > the
> > > > DAG
> > > > > > >>> Runs
> > > > > > >>>>>>>> manually in the UI? This is a pre-production,
> > in-development
> > > > > DAG,
> > > > > > >> so
> > > > > > >>>>>>> it's
> > > > > > >>>>>>>> not on the production web server. Am I supposed to fire
> up
> > > my
> > > > > own
> > > > > > >>> web
> > > > > > >>>>>>>> server to go and manually handle the backfill-related
> DAG
> > > > Runs?
> > > > > > >>>> Cannot to
> > > > > > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete
> > the
> > > > DAG
> > > > > > >>> runs,
> > > > > > >>>> it
> > > > > > >>>>>>>> appears I can finally start over.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Next thought was: "Alright looks like I need to go Linus
> > on
> > > > the
> > > > > > >>>> mailing
> > > > > > >>>>>>>> list".
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> What am I missing? I'm really hoping these issues
> specific
> > > to
> > > > > > >> 1.8.2!
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Backfilling is core to Airflow and should work very
> well.
> > I
> > > > want
> > > > > > >> to
> > > > > > >>>>>>> restate
> > > > > > >>>>>>>> some reqs for Airflow backfill:
> > > > > > >>>>>>>> * when failing / interrupted, it should seamlessly be
> able
> > > to
> > > > > > >> pickup
> > > > > > >>>>>>> where
> > > > > > >>>>>>>> it left off
> > > > > > >>>>>>>> * terminal logging at the INFO level should be a clear,
> > > human
> > > > > > >>>> consumable,
> > > > > > >>>>>>>> indicator of progress
> > > > > > >>>>>>>> * backfill-related operations (including restarts)
> should
> > be
> > > > > > >> doable
> > > > > > >>>>>>> through
> > > > > > >>>>>>>> CLI interactions, and not require web server
> interactions
> > as
> > > > the
> > > > > > >>>> typical
> > > > > > >>>>>>>> sandbox (dev environment) shouldn't assume the existence
> > of
> > > a
> > > > > web
> > > > > > >>>> server
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Let's fix this.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Max
> > > > > > >>>>>>>>
> > > > > > >>>>>>>
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Chao-Han Tsai
> > >
> >
>

Re: Is `airflow backfill` disfunctional?

Posted by "Driesprong, Fokko" <fo...@driesprong.frl>.

Good points James,

Personally, I never use the CLI backfilling, and also recommend colleagues
not to use it because of the points that you mention. I also resort to the
poor man's backfilling (clearing the future and past in the UI).

I'd rather get rid of the CLI, and would like to see the possibility to
submit a backfill job through the REST API. In this case, it can be part of
the web UI, but you could also write a CLI tool if that is your thing :-)

Cheers, Fokko

Op za 13 apr. 2019 om 23:26 schreef Maxime Beauchemin <
maximebeauchemin@gmail.com>:

> +1, backfilling, and related "subdag surgeries" are core to a data
> engineer's job, and great tooling around this is super important. Backfill
> needs more TLC!
>
> Max
>
> On Fri, Apr 12, 2019 at 11:48 PM Chao-Han Tsai <mi...@gmail.com>
> wrote:
>
> > +1 on improving backfill.
> >
> > - The terminal interface was uselessly verbose. It was scrolling fast
> > > enough to be unreadable.
> >
> >
> > I agree that backfill is currently too verbose. It simply logs to many
> > things and it is hard to read. Often times, I only care about the number
> of
> > tasks/dagruns that are in-progress/finished/not started. I had a PR
> > <https://github.com/apache/airflow/pull/3478> that implements a progress
> > bar for backfill but was not able to finish. Probably something that can
> > help improve the backfill experience.
> >
> > - The backfill exceeded safe concurrency limits for the cluster and
> > > could've easily brought it down if I'd left it running.
> >
> >
> > Btw. backfill now respects pool limitation but we should probably looking
> > into making it respect concurrency limit.
> >
> > Chao-Han
> >
> > >
> > >
> >
> >
> > On Mon, Mar 4, 2019 at 12:35 PM James Meickle
> > <jm...@quantopian.com.invalid> wrote:
> >
> > > This is an old thread, but I wanted to bump it as I just had a really
> bad
> > > experience using backfill. I'd been hesitant to even try backfills out
> > > given what I've read about it, so I've just relied on the UI to "Clear"
> > > entire tasks. However, I wanted to give it a shot the "right" way.
> > Issues I
> > > ran into:
> > >
> > > - The dry run flag didn't give good feedback about which dagruns and
> task
> > > instances will be affected (and is very easy to typo as "--dry-run")
> > >
> > > - The terminal interface was uselessly verbose. It was scrolling fast
> > > enough to be unreadable.
> > >
> > > - The backfill exceeded safe concurrency limits for the cluster and
> > > could've easily brought it down if I'd left it running.
> > >
> > > - Tasks in the backfill were executed out of order despite the tasks
> > having
> > > `depends_on_past`
> > >
> > > - The backfill converted all existing DAGRuns to be backfill runs that
> > the
> > > scheduler later ignored, which is not how I would've expected this to
> > work
> > > (nor was it indicated in the dry run)
> > >
> > > I ended up having to do manual recovery work in the database to turn
> the
> > > "backfill" runs back into scheduler runs, and then switch to using
> > `airflow
> > > clear`. I'm a heavy Airflow user and this took me an hour; it would've
> > been
> > > much worse for anyone else on my team.
> > >
> > > I don't have any specific suggestions here other than to confirm that
> > this
> > > feature needs an overhaul if it's to be recommended to anyone.
> > >
> > > On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <
> > > maximebeauchemin@gmail.com>
> > > wrote:
> > >
> > > > Ash I don't see how this could happen unless maybe the node doing the
> > > > backfill is using another metadata database.
> > > >
> > > > In general we recommend for people to run --local backfills and have
> > the
> > > > default/sandbox template for `airflow.cfg` use a LocalExecutor with
> > > > reasonable parallelism to make that behavior the default.
> > > >
> > > > Given the [not-so-great] state of backfill, I'm guessing many have
> been
> > > > using the scheduler to do backfills. From that regard it would be
> nice
> > to
> > > > have CLI commands to generate dagruns or alter the state of existing
> > ones
> > > >
> > > > Max
> > > >
> > > > On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> > > > ash_airflowlist@firemirror.com> wrote:
> > > >
> > > > > Somewhat related to this, but likely a different issue:
> > > > >
> > > > > I've just had a case where a long (7hours) running backfill task
> > ended
> > > up
> > > > > running twice somehow. We're using Celery so this might be related
> to
> > > > some
> > > > > sort of Celery visibility timeout, but I haven't had a chance to be
> > > able
> > > > to
> > > > > dig in to it in detail - it's 5pm on a Friday :D
> > > > >
> > > > > Has anyone else noticed anything similar?
> > > > >
> > > > > -ash
> > > > >
> > > > >
> > > > > > On 8 Jun 2018, at 01:22, Tao Feng <fe...@gmail.com> wrote:
> > > > > >
> > > > > > Thanks everyone for the feedback especially on the background for
> > > > > backfill.
> > > > > > After reading the discussion, I think it would be safest to add a
> > > flag
> > > > > for
> > > > > > auto rerun failed tasks for backfill with default to be false. I
> > have
> > > > > > updated the pr accordingly.
> > > > > >
> > > > > > Thanks a lot,
> > > > > > -Tao
> > > > > >
> > > > > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > > > > mark.whitfield@nytimes.com>
> > > > > > wrote:
> > > > > >
> > > > > >> I've been doing some work setting up a large, collaborative
> > Airflow
> > > > > >> pipeline with a group that makes heavy use of backfills, and
> have
> > > been
> > > > > >> encountering a lot of these issues myself.
> > > > > >>
> > > > > >> Other gripes:
> > > > > >>
> > > > > >> Backfills do not obey concurrency pool restrictions. We had been
> > > > making
> > > > > >> heavy use of SubDAGs and using concurrency pools to prevent
> > > deadlocks
> > > > > (why
> > > > > >> does the SubDAG itself even need to occupy a concurrency slot if
> > > none
> > > > of
> > > > > >> its constituent tasks are running?), but this quickly became
> > > untenable
> > > > > when
> > > > > >> using backfills and we were forced to mostly abandon SubDAGs.
> > > > > >>
> > > > > >> Backfills do use DagRuns now, which is a big improvement.
> However,
> > > > it's
> > > > > a
> > > > > >> common use case for us to add new tasks to a DAG and backfill
> to a
> > > > date
> > > > > >> specific to that task. When we do this, the BackfillJob will
> pick
> > up
> > > > > >> previous backfill DagRuns and re-use them, which is mostly nice
> > > > because
> > > > > it
> > > > > >> keeps the Tree view neatly organized in the UI. However, it does
> > not
> > > > > reset
> > > > > >> the start time of the DagRun when it does this. Combined with a
> > > > > DAG-level
> > > > > >> timeout, this means that the backfill job will activate a
> DagRun,
> > > but
> > > > > then
> > > > > >> the run will immediately time out (since it still thinks it's
> been
> > > > > running
> > > > > >> since the previous backfill). This will cause tasks to deadlock
> > > > > spuriously,
> > > > > >> making backfills extremely cumbersome to carry out.
> > > > > >>
> > > > > >> *Mark Whitfield*
> > > > > >> Data Scientist
> > > > > >> New York Times
> > > > > >>
> > > > > >>
> > > > > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > > > > >> maximebeauchemin@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Thanks for the input, this is helpful.
> > > > > >>>
> > > > > >>> To add to the list, there's some complexity around concurrency
> > > > > management
> > > > > >>> and multiple executors:
> > > > > >>> I just hit this thing where backfill doesn't check DAG-level
> > > > > concurrency,
> > > > > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > > > > concurrency
> > > > > >>> limit and exits. Right after backfill reschedules right away
> and
> > so
> > > > on,
> > > > > >>> burning a bunch of CPU doing nothing. In this specific case it
> > > seems
> > > > > like
> > > > > >>> `airflow run` should skip that specific check when in the
> context
> > > of
> > > > a
> > > > > >>> backfill.
> > > > > >>>
> > > > > >>> Max
> > > > > >>>
> > > > > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <
> bdbruin@gmail.com
> > >
> > > > > wrote:
> > > > > >>>
> > > > > >>>> Thinking out loud here, because it is a while back that I did
> > work
> > > > on
> > > > > >>>> backfills. There were some real issues with backfills:
> > > > > >>>>
> > > > > >>>> 1. Tasks were running in non deterministic order ending up in
> > > > regular
> > > > > >>>> deadlocks
> > > > > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max
> dag
> > > > runs
> > > > > >>>> could not be enforced. Ui could really display it, lots of
> minor
> > > > other
> > > > > >>>> issues because of it.
> > > > > >>>> 3. Behavior was different from the scheduler, while
> > > subdagoperators
> > > > > >>>> particularly make use of backfills at the moment.
> > > > > >>>>
> > > > > >>>> I think with 3 the behavior you are observing crept in. And
> > given
> > > 3
> > > > I
> > > > > >>>> would argue a consistent behavior between the scheduler and
> the
> > > > > >> backfill
> > > > > >>>> mechanism is still paramount. Thus we should explicitly clear
> > > tasks
> > > > > >> from
> > > > > >>>> failed if we want to rerun them. This at least until we move
> the
> > > > > >>>> subdagoperator out of backfill and into the scheduler (which
> is
> > > > > >> actually
> > > > > >>>> not too hard). Also we need those command line options anyway.
> > > > > >>>>
> > > > > >>>> Bolke
> > > > > >>>>
> > > > > >>>> Verstuurd vanaf mijn iPad
> > > > > >>>>
> > > > > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > > > > >> scott.halgrim@zapier.com
> > > > > >>> .INVALID>
> > > > > >>>> het volgende geschreven:
> > > > > >>>>>
> > > > > >>>>> The request was for opposition, but I’d like to weigh in on
> the
> > > > side
> > > > > >> of
> > > > > >>>> “it’s a better behavior [to have failed tasks re-run when
> > cleared
> > > > in a
> > > > > >>>> backfill"
> > > > > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > > > >>>> maximebeauchemin@gmail.com>, wrote:
> > > > > >>>>>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> > > > > >>> bdbruin@gmail.com>
> > > > > >>>> I
> > > > > >>>>>> think you may have some context on why this may have changed
> > at
> > > > some
> > > > > >>>> point.
> > > > > >>>>>> I'm assuming that when DagRun handling was added to the
> > backfill
> > > > > >>> logic,
> > > > > >>>> the
> > > > > >>>>>> behavior just happened to change to what it is now.
> > > > > >>>>>>
> > > > > >>>>>> Any opposition in moving back towards re-running failed
> tasks
> > > when
> > > > > >>>> starting
> > > > > >>>>>> a backfill? I think it's a better behavior, though it's a
> > change
> > > > in
> > > > > >>>>>> behavior that we should mention in UPDATE.md.
> > > > > >>>>>>
> > > > > >>>>>> One of our goals is to make sure that a failed or killed
> > > backfill
> > > > > >> can
> > > > > >>> be
> > > > > >>>>>> restarted and just seamlessly pick up where it left off.
> > > > > >>>>>>
> > > > > >>>>>> Max
> > > > > >>>>>>
> > > > > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <
> fengtao04@gmail.com
> > >
> > > > > >> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> After discussing with Max, we think it would be great if
> > > `airflow
> > > > > >>>> backfill`
> > > > > >>>>>>> could be able to auto pick up and rerun those failed tasks.
> > > > > >>> Currently,
> > > > > >>>> it
> > > > > >>>>>>> will throw exceptions(
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>
> > > > > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > > > > >> low/jobs.py#L2489
> > > > > >>>>>>> )
> > > > > >>>>>>> without rerunning the failed tasks.
> > > > > >>>>>>>
> > > > > >>>>>>> But since it broke some of the previous assumptions for
> > > backfill,
> > > > > >> we
> > > > > >>>> would
> > > > > >>>>>>> like to get some feedback and see if anyone has any
> > concerns(pr
> > > > > >> could
> > > > > >>>> be
> > > > > >>>>>>> found at https://github.com/apache/incu
> > > > > >> bator-airflow/pull/3464/files
> > > > > >>> ).
> > > > > >>>>>>>
> > > > > >>>>>>> Thanks,
> > > > > >>>>>>> -Tao
> > > > > >>>>>>>
> > > > > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > > > >>>>>>> maximebeauchemin@gmail.com> wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>> So I'm running a backfill for what feels like the first
> time
> > > in
> > > > > >>> years
> > > > > >>>>>>> using
> > > > > >>>>>>>> a simple `airflow backfill --local` commands.
> > > > > >>>>>>>>
> > > > > >>>>>>>> First I start getting a ton of `logging.info` of each
> tasks
> > > > that
> > > > > >>>> cannot
> > > > > >>>>>>> be
> > > > > >>>>>>>> started just yet at every tick flooding my terminal with
> the
> > > > > >> keyword
> > > > > >>>>>>>> `FAILED` in it, looking like a million of lines like this
> > one:
> > > > > >>>>>>>>
> > > > > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO -
> > Dependencies
> > > > not
> > > > > >>> met
> > > > > >>>>>>> for
> > > > > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > > > > >>> [scheduled]>,
> > > > > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > > > > >> 'all_success'
> > > > > >>> re
> > > > > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > > > > >>>> non-success(es).
> > > > > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > > > >>>> 'upstream_failed':
> > > > > >>>>>>>> 0L,
> > > > > >>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
> > > > > >> _task_id']
> > > > > >>>>>>>>
> > > > > >>>>>>>> Good thing I triggered 1 month and not 2 years like I
> > actually
> > > > > >> need,
> > > > > >>>> just
> > > > > >>>>>>>> the logs here would be "big data". Now I'm unclear whether
> > > > there's
> > > > > >>>>>>> anything
> > > > > >>>>>>>> actually running or if I did something wrong, so I decide
> to
> > > > kill
> > > > > >>> the
> > > > > >>>>>>>> process so I can set a smaller date range and get a better
> > > > picture
> > > > > >>> of
> > > > > >>>>>>>> what's up.
> > > > > >>>>>>>>
> > > > > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just INFO.
> > So I
> > > > > >> take
> > > > > >>> a
> > > > > >>>>>>> note
> > > > > >>>>>>>> that I'll need to find that log-flooding line and demote
> it
> > to
> > > > > >> DEBUG
> > > > > >>>> in a
> > > > > >>>>>>>> quick PR, no biggy.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Now I restart with just a single schedule, and get an
> error
> > > `Dag
> > > > > >>>>>>> {some_dag}
> > > > > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish
> > > backfill
> > > > > >>> could
> > > > > >>>>>>> just
> > > > > >>>>>>>> pickup where it left off. Maybe I need to run an `airflow
> > > clear`
> > > > > >>>> command
> > > > > >>>>>>>> and restart? Ok, ran my clear command, same error is
> showing
> > > up.
> > > > > >>> Dead
> > > > > >>>>>>> end.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
> > > option?
> > > > > >>>> Doesn't
> > > > > >>>>>>>> look like it... Maybe `airflow backfill` has some new
> > switches
> > > > to
> > > > > >>>> pick up
> > > > > >>>>>>>> where it left off? Can't find it. Am I supposed to clear
> the
> > > DAG
> > > > > >>> Runs
> > > > > >>>>>>>> manually in the UI? This is a pre-production,
> in-development
> > > > DAG,
> > > > > >> so
> > > > > >>>>>>> it's
> > > > > >>>>>>>> not on the production web server. Am I supposed to fire up
> > my
> > > > own
> > > > > >>> web
> > > > > >>>>>>>> server to go and manually handle the backfill-related DAG
> > > Runs?
> > > > > >>>> Cannot to
> > > > > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > > > > >>>>>>>>
> > > > > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete
> the
> > > DAG
> > > > > >>> runs,
> > > > > >>>> it
> > > > > >>>>>>>> appears I can finally start over.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Next thought was: "Alright looks like I need to go Linus
> on
> > > the
> > > > > >>>> mailing
> > > > > >>>>>>>> list".
> > > > > >>>>>>>>
> > > > > >>>>>>>> What am I missing? I'm really hoping these issues specific
> > to
> > > > > >> 1.8.2!
> > > > > >>>>>>>>
> > > > > >>>>>>>> Backfilling is core to Airflow and should work very well.
> I
> > > want
> > > > > >> to
> > > > > >>>>>>> restate
> > > > > >>>>>>>> some reqs for Airflow backfill:
> > > > > >>>>>>>> * when failing / interrupted, it should seamlessly be able
> > to
> > > > > >> pickup
> > > > > >>>>>>> where
> > > > > >>>>>>>> it left off
> > > > > >>>>>>>> * terminal logging at the INFO level should be a clear,
> > human
> > > > > >>>> consumable,
> > > > > >>>>>>>> indicator of progress
> > > > > >>>>>>>> * backfill-related operations (including restarts) should
> be
> > > > > >> doable
> > > > > >>>>>>> through
> > > > > >>>>>>>> CLI interactions, and not require web server interactions
> as
> > > the
> > > > > >>>> typical
> > > > > >>>>>>>> sandbox (dev environment) shouldn't assume the existence
> of
> > a
> > > > web
> > > > > >>>> server
> > > > > >>>>>>>>
> > > > > >>>>>>>> Let's fix this.
> > > > > >>>>>>>>
> > > > > >>>>>>>> Max
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> >
> > Chao-Han Tsai
> >
>

Re: Is `airflow backfill` disfunctional?

Posted by Maxime Beauchemin <ma...@gmail.com>.

+1, backfilling, and related "subdag surgeries" are core to a data
engineer's job, and great tooling around this is super important. Backfill
needs more TLC!

Max

On Fri, Apr 12, 2019 at 11:48 PM Chao-Han Tsai <mi...@gmail.com> wrote:

> +1 on improving backfill.
>
> - The terminal interface was uselessly verbose. It was scrolling fast
> > enough to be unreadable.
>
>
> I agree that backfill is currently too verbose. It simply logs to many
> things and it is hard to read. Often times, I only care about the number of
> tasks/dagruns that are in-progress/finished/not started. I had a PR
> <https://github.com/apache/airflow/pull/3478> that implements a progress
> bar for backfill but was not able to finish. Probably something that can
> help improve the backfill experience.
>
> - The backfill exceeded safe concurrency limits for the cluster and
> > could've easily brought it down if I'd left it running.
>
>
> Btw. backfill now respects pool limitation but we should probably looking
> into making it respect concurrency limit.
>
> Chao-Han
>
> >
> >
>
>
> On Mon, Mar 4, 2019 at 12:35 PM James Meickle
> <jm...@quantopian.com.invalid> wrote:
>
> > This is an old thread, but I wanted to bump it as I just had a really bad
> > experience using backfill. I'd been hesitant to even try backfills out
> > given what I've read about it, so I've just relied on the UI to "Clear"
> > entire tasks. However, I wanted to give it a shot the "right" way.
> Issues I
> > ran into:
> >
> > - The dry run flag didn't give good feedback about which dagruns and task
> > instances will be affected (and is very easy to typo as "--dry-run")
> >
> > - The terminal interface was uselessly verbose. It was scrolling fast
> > enough to be unreadable.
> >
> > - The backfill exceeded safe concurrency limits for the cluster and
> > could've easily brought it down if I'd left it running.
> >
> > - Tasks in the backfill were executed out of order despite the tasks
> having
> > `depends_on_past`
> >
> > - The backfill converted all existing DAGRuns to be backfill runs that
> the
> > scheduler later ignored, which is not how I would've expected this to
> work
> > (nor was it indicated in the dry run)
> >
> > I ended up having to do manual recovery work in the database to turn the
> > "backfill" runs back into scheduler runs, and then switch to using
> `airflow
> > clear`. I'm a heavy Airflow user and this took me an hour; it would've
> been
> > much worse for anyone else on my team.
> >
> > I don't have any specific suggestions here other than to confirm that
> this
> > feature needs an overhaul if it's to be recommended to anyone.
> >
> > On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <
> > maximebeauchemin@gmail.com>
> > wrote:
> >
> > > Ash I don't see how this could happen unless maybe the node doing the
> > > backfill is using another metadata database.
> > >
> > > In general we recommend for people to run --local backfills and have
> the
> > > default/sandbox template for `airflow.cfg` use a LocalExecutor with
> > > reasonable parallelism to make that behavior the default.
> > >
> > > Given the [not-so-great] state of backfill, I'm guessing many have been
> > > using the scheduler to do backfills. From that regard it would be nice
> to
> > > have CLI commands to generate dagruns or alter the state of existing
> ones
> > >
> > > Max
> > >
> > > On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> > > ash_airflowlist@firemirror.com> wrote:
> > >
> > > > Somewhat related to this, but likely a different issue:
> > > >
> > > > I've just had a case where a long (7hours) running backfill task
> ended
> > up
> > > > running twice somehow. We're using Celery so this might be related to
> > > some
> > > > sort of Celery visibility timeout, but I haven't had a chance to be
> > able
> > > to
> > > > dig in to it in detail - it's 5pm on a Friday :D
> > > >
> > > > Has anyone else noticed anything similar?
> > > >
> > > > -ash
> > > >
> > > >
> > > > > On 8 Jun 2018, at 01:22, Tao Feng <fe...@gmail.com> wrote:
> > > > >
> > > > > Thanks everyone for the feedback especially on the background for
> > > > backfill.
> > > > > After reading the discussion, I think it would be safest to add a
> > flag
> > > > for
> > > > > auto rerun failed tasks for backfill with default to be false. I
> have
> > > > > updated the pr accordingly.
> > > > >
> > > > > Thanks a lot,
> > > > > -Tao
> > > > >
> > > > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > > > mark.whitfield@nytimes.com>
> > > > > wrote:
> > > > >
> > > > >> I've been doing some work setting up a large, collaborative
> Airflow
> > > > >> pipeline with a group that makes heavy use of backfills, and have
> > been
> > > > >> encountering a lot of these issues myself.
> > > > >>
> > > > >> Other gripes:
> > > > >>
> > > > >> Backfills do not obey concurrency pool restrictions. We had been
> > > making
> > > > >> heavy use of SubDAGs and using concurrency pools to prevent
> > deadlocks
> > > > (why
> > > > >> does the SubDAG itself even need to occupy a concurrency slot if
> > none
> > > of
> > > > >> its constituent tasks are running?), but this quickly became
> > untenable
> > > > when
> > > > >> using backfills and we were forced to mostly abandon SubDAGs.
> > > > >>
> > > > >> Backfills do use DagRuns now, which is a big improvement. However,
> > > it's
> > > > a
> > > > >> common use case for us to add new tasks to a DAG and backfill to a
> > > date
> > > > >> specific to that task. When we do this, the BackfillJob will pick
> up
> > > > >> previous backfill DagRuns and re-use them, which is mostly nice
> > > because
> > > > it
> > > > >> keeps the Tree view neatly organized in the UI. However, it does
> not
> > > > reset
> > > > >> the start time of the DagRun when it does this. Combined with a
> > > > DAG-level
> > > > >> timeout, this means that the backfill job will activate a DagRun,
> > but
> > > > then
> > > > >> the run will immediately time out (since it still thinks it's been
> > > > running
> > > > >> since the previous backfill). This will cause tasks to deadlock
> > > > spuriously,
> > > > >> making backfills extremely cumbersome to carry out.
> > > > >>
> > > > >> *Mark Whitfield*
> > > > >> Data Scientist
> > > > >> New York Times
> > > > >>
> > > > >>
> > > > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > > > >> maximebeauchemin@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>> Thanks for the input, this is helpful.
> > > > >>>
> > > > >>> To add to the list, there's some complexity around concurrency
> > > > management
> > > > >>> and multiple executors:
> > > > >>> I just hit this thing where backfill doesn't check DAG-level
> > > > concurrency,
> > > > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > > > concurrency
> > > > >>> limit and exits. Right after backfill reschedules right away and
> so
> > > on,
> > > > >>> burning a bunch of CPU doing nothing. In this specific case it
> > seems
> > > > like
> > > > >>> `airflow run` should skip that specific check when in the context
> > of
> > > a
> > > > >>> backfill.
> > > > >>>
> > > > >>> Max
> > > > >>>
> > > > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bdbruin@gmail.com
> >
> > > > wrote:
> > > > >>>
> > > > >>>> Thinking out loud here, because it is a while back that I did
> work
> > > on
> > > > >>>> backfills. There were some real issues with backfills:
> > > > >>>>
> > > > >>>> 1. Tasks were running in non deterministic order ending up in
> > > regular
> > > > >>>> deadlocks
> > > > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max dag
> > > runs
> > > > >>>> could not be enforced. Ui could really display it, lots of minor
> > > other
> > > > >>>> issues because of it.
> > > > >>>> 3. Behavior was different from the scheduler, while
> > subdagoperators
> > > > >>>> particularly make use of backfills at the moment.
> > > > >>>>
> > > > >>>> I think with 3 the behavior you are observing crept in. And
> given
> > 3
> > > I
> > > > >>>> would argue a consistent behavior between the scheduler and the
> > > > >> backfill
> > > > >>>> mechanism is still paramount. Thus we should explicitly clear
> > tasks
> > > > >> from
> > > > >>>> failed if we want to rerun them. This at least until we move the
> > > > >>>> subdagoperator out of backfill and into the scheduler (which is
> > > > >> actually
> > > > >>>> not too hard). Also we need those command line options anyway.
> > > > >>>>
> > > > >>>> Bolke
> > > > >>>>
> > > > >>>> Verstuurd vanaf mijn iPad
> > > > >>>>
> > > > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > > > >> scott.halgrim@zapier.com
> > > > >>> .INVALID>
> > > > >>>> het volgende geschreven:
> > > > >>>>>
> > > > >>>>> The request was for opposition, but I’d like to weigh in on the
> > > side
> > > > >> of
> > > > >>>> “it’s a better behavior [to have failed tasks re-run when
> cleared
> > > in a
> > > > >>>> backfill"
> > > > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > > >>>> maximebeauchemin@gmail.com>, wrote:
> > > > >>>>>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> > > > >>> bdbruin@gmail.com>
> > > > >>>> I
> > > > >>>>>> think you may have some context on why this may have changed
> at
> > > some
> > > > >>>> point.
> > > > >>>>>> I'm assuming that when DagRun handling was added to the
> backfill
> > > > >>> logic,
> > > > >>>> the
> > > > >>>>>> behavior just happened to change to what it is now.
> > > > >>>>>>
> > > > >>>>>> Any opposition in moving back towards re-running failed tasks
> > when
> > > > >>>> starting
> > > > >>>>>> a backfill? I think it's a better behavior, though it's a
> change
> > > in
> > > > >>>>>> behavior that we should mention in UPDATE.md.
> > > > >>>>>>
> > > > >>>>>> One of our goals is to make sure that a failed or killed
> > backfill
> > > > >> can
> > > > >>> be
> > > > >>>>>> restarted and just seamlessly pick up where it left off.
> > > > >>>>>>
> > > > >>>>>> Max
> > > > >>>>>>
> > > > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fengtao04@gmail.com
> >
> > > > >> wrote:
> > > > >>>>>>>
> > > > >>>>>>> After discussing with Max, we think it would be great if
> > `airflow
> > > > >>>> backfill`
> > > > >>>>>>> could be able to auto pick up and rerun those failed tasks.
> > > > >>> Currently,
> > > > >>>> it
> > > > >>>>>>> will throw exceptions(
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>
> > > > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > > > >> low/jobs.py#L2489
> > > > >>>>>>> )
> > > > >>>>>>> without rerunning the failed tasks.
> > > > >>>>>>>
> > > > >>>>>>> But since it broke some of the previous assumptions for
> > backfill,
> > > > >> we
> > > > >>>> would
> > > > >>>>>>> like to get some feedback and see if anyone has any
> concerns(pr
> > > > >> could
> > > > >>>> be
> > > > >>>>>>> found at https://github.com/apache/incu
> > > > >> bator-airflow/pull/3464/files
> > > > >>> ).
> > > > >>>>>>>
> > > > >>>>>>> Thanks,
> > > > >>>>>>> -Tao
> > > > >>>>>>>
> > > > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > > >>>>>>> maximebeauchemin@gmail.com> wrote:
> > > > >>>>>>>
> > > > >>>>>>>> So I'm running a backfill for what feels like the first time
> > in
> > > > >>> years
> > > > >>>>>>> using
> > > > >>>>>>>> a simple `airflow backfill --local` commands.
> > > > >>>>>>>>
> > > > >>>>>>>> First I start getting a ton of `logging.info` of each tasks
> > > that
> > > > >>>> cannot
> > > > >>>>>>> be
> > > > >>>>>>>> started just yet at every tick flooding my terminal with the
> > > > >> keyword
> > > > >>>>>>>> `FAILED` in it, looking like a million of lines like this
> one:
> > > > >>>>>>>>
> > > > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO -
> Dependencies
> > > not
> > > > >>> met
> > > > >>>>>>> for
> > > > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > > > >>> [scheduled]>,
> > > > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > > > >> 'all_success'
> > > > >>> re
> > > > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > > > >>>> non-success(es).
> > > > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > > >>>> 'upstream_failed':
> > > > >>>>>>>> 0L,
> > > > >>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
> > > > >> _task_id']
> > > > >>>>>>>>
> > > > >>>>>>>> Good thing I triggered 1 month and not 2 years like I
> actually
> > > > >> need,
> > > > >>>> just
> > > > >>>>>>>> the logs here would be "big data". Now I'm unclear whether
> > > there's
> > > > >>>>>>> anything
> > > > >>>>>>>> actually running or if I did something wrong, so I decide to
> > > kill
> > > > >>> the
> > > > >>>>>>>> process so I can set a smaller date range and get a better
> > > picture
> > > > >>> of
> > > > >>>>>>>> what's up.
> > > > >>>>>>>>
> > > > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just INFO.
> So I
> > > > >> take
> > > > >>> a
> > > > >>>>>>> note
> > > > >>>>>>>> that I'll need to find that log-flooding line and demote it
> to
> > > > >> DEBUG
> > > > >>>> in a
> > > > >>>>>>>> quick PR, no biggy.
> > > > >>>>>>>>
> > > > >>>>>>>> Now I restart with just a single schedule, and get an error
> > `Dag
> > > > >>>>>>> {some_dag}
> > > > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish
> > backfill
> > > > >>> could
> > > > >>>>>>> just
> > > > >>>>>>>> pickup where it left off. Maybe I need to run an `airflow
> > clear`
> > > > >>>> command
> > > > >>>>>>>> and restart? Ok, ran my clear command, same error is showing
> > up.
> > > > >>> Dead
> > > > >>>>>>> end.
> > > > >>>>>>>>
> > > > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
> > option?
> > > > >>>> Doesn't
> > > > >>>>>>>> look like it... Maybe `airflow backfill` has some new
> switches
> > > to
> > > > >>>> pick up
> > > > >>>>>>>> where it left off? Can't find it. Am I supposed to clear the
> > DAG
> > > > >>> Runs
> > > > >>>>>>>> manually in the UI? This is a pre-production, in-development
> > > DAG,
> > > > >> so
> > > > >>>>>>> it's
> > > > >>>>>>>> not on the production web server. Am I supposed to fire up
> my
> > > own
> > > > >>> web
> > > > >>>>>>>> server to go and manually handle the backfill-related DAG
> > Runs?
> > > > >>>> Cannot to
> > > > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > > > >>>>>>>>
> > > > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete the
> > DAG
> > > > >>> runs,
> > > > >>>> it
> > > > >>>>>>>> appears I can finally start over.
> > > > >>>>>>>>
> > > > >>>>>>>> Next thought was: "Alright looks like I need to go Linus on
> > the
> > > > >>>> mailing
> > > > >>>>>>>> list".
> > > > >>>>>>>>
> > > > >>>>>>>> What am I missing? I'm really hoping these issues specific
> to
> > > > >> 1.8.2!
> > > > >>>>>>>>
> > > > >>>>>>>> Backfilling is core to Airflow and should work very well. I
> > want
> > > > >> to
> > > > >>>>>>> restate
> > > > >>>>>>>> some reqs for Airflow backfill:
> > > > >>>>>>>> * when failing / interrupted, it should seamlessly be able
> to
> > > > >> pickup
> > > > >>>>>>> where
> > > > >>>>>>>> it left off
> > > > >>>>>>>> * terminal logging at the INFO level should be a clear,
> human
> > > > >>>> consumable,
> > > > >>>>>>>> indicator of progress
> > > > >>>>>>>> * backfill-related operations (including restarts) should be
> > > > >> doable
> > > > >>>>>>> through
> > > > >>>>>>>> CLI interactions, and not require web server interactions as
> > the
> > > > >>>> typical
> > > > >>>>>>>> sandbox (dev environment) shouldn't assume the existence of
> a
> > > web
> > > > >>>> server
> > > > >>>>>>>>
> > > > >>>>>>>> Let's fix this.
> > > > >>>>>>>>
> > > > >>>>>>>> Max
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>
>
> --
>
> Chao-Han Tsai
>

Re: Is `airflow backfill` disfunctional?

Posted by Chao-Han Tsai <mi...@gmail.com>.

+1 on improving backfill.

- The terminal interface was uselessly verbose. It was scrolling fast
> enough to be unreadable.


I agree that backfill is currently too verbose. It simply logs to many
things and it is hard to read. Often times, I only care about the number of
tasks/dagruns that are in-progress/finished/not started. I had a PR
<https://github.com/apache/airflow/pull/3478> that implements a progress
bar for backfill but was not able to finish. Probably something that can
help improve the backfill experience.

- The backfill exceeded safe concurrency limits for the cluster and
> could've easily brought it down if I'd left it running.


Btw. backfill now respects pool limitation but we should probably looking
into making it respect concurrency limit.

Chao-Han

>
>


On Mon, Mar 4, 2019 at 12:35 PM James Meickle
<jm...@quantopian.com.invalid> wrote:

> This is an old thread, but I wanted to bump it as I just had a really bad
> experience using backfill. I'd been hesitant to even try backfills out
> given what I've read about it, so I've just relied on the UI to "Clear"
> entire tasks. However, I wanted to give it a shot the "right" way. Issues I
> ran into:
>
> - The dry run flag didn't give good feedback about which dagruns and task
> instances will be affected (and is very easy to typo as "--dry-run")
>
> - The terminal interface was uselessly verbose. It was scrolling fast
> enough to be unreadable.
>
> - The backfill exceeded safe concurrency limits for the cluster and
> could've easily brought it down if I'd left it running.
>
> - Tasks in the backfill were executed out of order despite the tasks having
> `depends_on_past`
>
> - The backfill converted all existing DAGRuns to be backfill runs that the
> scheduler later ignored, which is not how I would've expected this to work
> (nor was it indicated in the dry run)
>
> I ended up having to do manual recovery work in the database to turn the
> "backfill" runs back into scheduler runs, and then switch to using `airflow
> clear`. I'm a heavy Airflow user and this took me an hour; it would've been
> much worse for anyone else on my team.
>
> I don't have any specific suggestions here other than to confirm that this
> feature needs an overhaul if it's to be recommended to anyone.
>
> On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <
> maximebeauchemin@gmail.com>
> wrote:
>
> > Ash I don't see how this could happen unless maybe the node doing the
> > backfill is using another metadata database.
> >
> > In general we recommend for people to run --local backfills and have the
> > default/sandbox template for `airflow.cfg` use a LocalExecutor with
> > reasonable parallelism to make that behavior the default.
> >
> > Given the [not-so-great] state of backfill, I'm guessing many have been
> > using the scheduler to do backfills. From that regard it would be nice to
> > have CLI commands to generate dagruns or alter the state of existing ones
> >
> > Max
> >
> > On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> > ash_airflowlist@firemirror.com> wrote:
> >
> > > Somewhat related to this, but likely a different issue:
> > >
> > > I've just had a case where a long (7hours) running backfill task ended
> up
> > > running twice somehow. We're using Celery so this might be related to
> > some
> > > sort of Celery visibility timeout, but I haven't had a chance to be
> able
> > to
> > > dig in to it in detail - it's 5pm on a Friday :D
> > >
> > > Has anyone else noticed anything similar?
> > >
> > > -ash
> > >
> > >
> > > > On 8 Jun 2018, at 01:22, Tao Feng <fe...@gmail.com> wrote:
> > > >
> > > > Thanks everyone for the feedback especially on the background for
> > > backfill.
> > > > After reading the discussion, I think it would be safest to add a
> flag
> > > for
> > > > auto rerun failed tasks for backfill with default to be false. I have
> > > > updated the pr accordingly.
> > > >
> > > > Thanks a lot,
> > > > -Tao
> > > >
> > > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > > mark.whitfield@nytimes.com>
> > > > wrote:
> > > >
> > > >> I've been doing some work setting up a large, collaborative Airflow
> > > >> pipeline with a group that makes heavy use of backfills, and have
> been
> > > >> encountering a lot of these issues myself.
> > > >>
> > > >> Other gripes:
> > > >>
> > > >> Backfills do not obey concurrency pool restrictions. We had been
> > making
> > > >> heavy use of SubDAGs and using concurrency pools to prevent
> deadlocks
> > > (why
> > > >> does the SubDAG itself even need to occupy a concurrency slot if
> none
> > of
> > > >> its constituent tasks are running?), but this quickly became
> untenable
> > > when
> > > >> using backfills and we were forced to mostly abandon SubDAGs.
> > > >>
> > > >> Backfills do use DagRuns now, which is a big improvement. However,
> > it's
> > > a
> > > >> common use case for us to add new tasks to a DAG and backfill to a
> > date
> > > >> specific to that task. When we do this, the BackfillJob will pick up
> > > >> previous backfill DagRuns and re-use them, which is mostly nice
> > because
> > > it
> > > >> keeps the Tree view neatly organized in the UI. However, it does not
> > > reset
> > > >> the start time of the DagRun when it does this. Combined with a
> > > DAG-level
> > > >> timeout, this means that the backfill job will activate a DagRun,
> but
> > > then
> > > >> the run will immediately time out (since it still thinks it's been
> > > running
> > > >> since the previous backfill). This will cause tasks to deadlock
> > > spuriously,
> > > >> making backfills extremely cumbersome to carry out.
> > > >>
> > > >> *Mark Whitfield*
> > > >> Data Scientist
> > > >> New York Times
> > > >>
> > > >>
> > > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > > >> maximebeauchemin@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Thanks for the input, this is helpful.
> > > >>>
> > > >>> To add to the list, there's some complexity around concurrency
> > > management
> > > >>> and multiple executors:
> > > >>> I just hit this thing where backfill doesn't check DAG-level
> > > concurrency,
> > > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > > concurrency
> > > >>> limit and exits. Right after backfill reschedules right away and so
> > on,
> > > >>> burning a bunch of CPU doing nothing. In this specific case it
> seems
> > > like
> > > >>> `airflow run` should skip that specific check when in the context
> of
> > a
> > > >>> backfill.
> > > >>>
> > > >>> Max
> > > >>>
> > > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bd...@gmail.com>
> > > wrote:
> > > >>>
> > > >>>> Thinking out loud here, because it is a while back that I did work
> > on
> > > >>>> backfills. There were some real issues with backfills:
> > > >>>>
> > > >>>> 1. Tasks were running in non deterministic order ending up in
> > regular
> > > >>>> deadlocks
> > > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max dag
> > runs
> > > >>>> could not be enforced. Ui could really display it, lots of minor
> > other
> > > >>>> issues because of it.
> > > >>>> 3. Behavior was different from the scheduler, while
> subdagoperators
> > > >>>> particularly make use of backfills at the moment.
> > > >>>>
> > > >>>> I think with 3 the behavior you are observing crept in. And given
> 3
> > I
> > > >>>> would argue a consistent behavior between the scheduler and the
> > > >> backfill
> > > >>>> mechanism is still paramount. Thus we should explicitly clear
> tasks
> > > >> from
> > > >>>> failed if we want to rerun them. This at least until we move the
> > > >>>> subdagoperator out of backfill and into the scheduler (which is
> > > >> actually
> > > >>>> not too hard). Also we need those command line options anyway.
> > > >>>>
> > > >>>> Bolke
> > > >>>>
> > > >>>> Verstuurd vanaf mijn iPad
> > > >>>>
> > > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > > >> scott.halgrim@zapier.com
> > > >>> .INVALID>
> > > >>>> het volgende geschreven:
> > > >>>>>
> > > >>>>> The request was for opposition, but I’d like to weigh in on the
> > side
> > > >> of
> > > >>>> “it’s a better behavior [to have failed tasks re-run when cleared
> > in a
> > > >>>> backfill"
> > > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > >>>> maximebeauchemin@gmail.com>, wrote:
> > > >>>>>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> > > >>> bdbruin@gmail.com>
> > > >>>> I
> > > >>>>>> think you may have some context on why this may have changed at
> > some
> > > >>>> point.
> > > >>>>>> I'm assuming that when DagRun handling was added to the backfill
> > > >>> logic,
> > > >>>> the
> > > >>>>>> behavior just happened to change to what it is now.
> > > >>>>>>
> > > >>>>>> Any opposition in moving back towards re-running failed tasks
> when
> > > >>>> starting
> > > >>>>>> a backfill? I think it's a better behavior, though it's a change
> > in
> > > >>>>>> behavior that we should mention in UPDATE.md.
> > > >>>>>>
> > > >>>>>> One of our goals is to make sure that a failed or killed
> backfill
> > > >> can
> > > >>> be
> > > >>>>>> restarted and just seamlessly pick up where it left off.
> > > >>>>>>
> > > >>>>>> Max
> > > >>>>>>
> > > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com>
> > > >> wrote:
> > > >>>>>>>
> > > >>>>>>> After discussing with Max, we think it would be great if
> `airflow
> > > >>>> backfill`
> > > >>>>>>> could be able to auto pick up and rerun those failed tasks.
> > > >>> Currently,
> > > >>>> it
> > > >>>>>>> will throw exceptions(
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>
> > > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > > >> low/jobs.py#L2489
> > > >>>>>>> )
> > > >>>>>>> without rerunning the failed tasks.
> > > >>>>>>>
> > > >>>>>>> But since it broke some of the previous assumptions for
> backfill,
> > > >> we
> > > >>>> would
> > > >>>>>>> like to get some feedback and see if anyone has any concerns(pr
> > > >> could
> > > >>>> be
> > > >>>>>>> found at https://github.com/apache/incu
> > > >> bator-airflow/pull/3464/files
> > > >>> ).
> > > >>>>>>>
> > > >>>>>>> Thanks,
> > > >>>>>>> -Tao
> > > >>>>>>>
> > > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > >>>>>>> maximebeauchemin@gmail.com> wrote:
> > > >>>>>>>
> > > >>>>>>>> So I'm running a backfill for what feels like the first time
> in
> > > >>> years
> > > >>>>>>> using
> > > >>>>>>>> a simple `airflow backfill --local` commands.
> > > >>>>>>>>
> > > >>>>>>>> First I start getting a ton of `logging.info` of each tasks
> > that
> > > >>>> cannot
> > > >>>>>>> be
> > > >>>>>>>> started just yet at every tick flooding my terminal with the
> > > >> keyword
> > > >>>>>>>> `FAILED` in it, looking like a million of lines like this one:
> > > >>>>>>>>
> > > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies
> > not
> > > >>> met
> > > >>>>>>> for
> > > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > > >>> [scheduled]>,
> > > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > > >> 'all_success'
> > > >>> re
> > > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > > >>>> non-success(es).
> > > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > >>>> 'upstream_failed':
> > > >>>>>>>> 0L,
> > > >>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
> > > >> _task_id']
> > > >>>>>>>>
> > > >>>>>>>> Good thing I triggered 1 month and not 2 years like I actually
> > > >> need,
> > > >>>> just
> > > >>>>>>>> the logs here would be "big data". Now I'm unclear whether
> > there's
> > > >>>>>>> anything
> > > >>>>>>>> actually running or if I did something wrong, so I decide to
> > kill
> > > >>> the
> > > >>>>>>>> process so I can set a smaller date range and get a better
> > picture
> > > >>> of
> > > >>>>>>>> what's up.
> > > >>>>>>>>
> > > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I
> > > >> take
> > > >>> a
> > > >>>>>>> note
> > > >>>>>>>> that I'll need to find that log-flooding line and demote it to
> > > >> DEBUG
> > > >>>> in a
> > > >>>>>>>> quick PR, no biggy.
> > > >>>>>>>>
> > > >>>>>>>> Now I restart with just a single schedule, and get an error
> `Dag
> > > >>>>>>> {some_dag}
> > > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish
> backfill
> > > >>> could
> > > >>>>>>> just
> > > >>>>>>>> pickup where it left off. Maybe I need to run an `airflow
> clear`
> > > >>>> command
> > > >>>>>>>> and restart? Ok, ran my clear command, same error is showing
> up.
> > > >>> Dead
> > > >>>>>>> end.
> > > >>>>>>>>
> > > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns`
> option?
> > > >>>> Doesn't
> > > >>>>>>>> look like it... Maybe `airflow backfill` has some new switches
> > to
> > > >>>> pick up
> > > >>>>>>>> where it left off? Can't find it. Am I supposed to clear the
> DAG
> > > >>> Runs
> > > >>>>>>>> manually in the UI? This is a pre-production, in-development
> > DAG,
> > > >> so
> > > >>>>>>> it's
> > > >>>>>>>> not on the production web server. Am I supposed to fire up my
> > own
> > > >>> web
> > > >>>>>>>> server to go and manually handle the backfill-related DAG
> Runs?
> > > >>>> Cannot to
> > > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > > >>>>>>>>
> > > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete the
> DAG
> > > >>> runs,
> > > >>>> it
> > > >>>>>>>> appears I can finally start over.
> > > >>>>>>>>
> > > >>>>>>>> Next thought was: "Alright looks like I need to go Linus on
> the
> > > >>>> mailing
> > > >>>>>>>> list".
> > > >>>>>>>>
> > > >>>>>>>> What am I missing? I'm really hoping these issues specific to
> > > >> 1.8.2!
> > > >>>>>>>>
> > > >>>>>>>> Backfilling is core to Airflow and should work very well. I
> want
> > > >> to
> > > >>>>>>> restate
> > > >>>>>>>> some reqs for Airflow backfill:
> > > >>>>>>>> * when failing / interrupted, it should seamlessly be able to
> > > >> pickup
> > > >>>>>>> where
> > > >>>>>>>> it left off
> > > >>>>>>>> * terminal logging at the INFO level should be a clear, human
> > > >>>> consumable,
> > > >>>>>>>> indicator of progress
> > > >>>>>>>> * backfill-related operations (including restarts) should be
> > > >> doable
> > > >>>>>>> through
> > > >>>>>>>> CLI interactions, and not require web server interactions as
> the
> > > >>>> typical
> > > >>>>>>>> sandbox (dev environment) shouldn't assume the existence of a
> > web
> > > >>>> server
> > > >>>>>>>>
> > > >>>>>>>> Let's fix this.
> > > >>>>>>>>
> > > >>>>>>>> Max
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>


-- 

Chao-Han Tsai

Re: Is `airflow backfill` disfunctional?

Posted by James Meickle <jm...@quantopian.com.INVALID>.

This is an old thread, but I wanted to bump it as I just had a really bad
experience using backfill. I'd been hesitant to even try backfills out
given what I've read about it, so I've just relied on the UI to "Clear"
entire tasks. However, I wanted to give it a shot the "right" way. Issues I
ran into:

- The dry run flag didn't give good feedback about which dagruns and task
instances will be affected (and is very easy to typo as "--dry-run")

- The terminal interface was uselessly verbose. It was scrolling fast
enough to be unreadable.

- The backfill exceeded safe concurrency limits for the cluster and
could've easily brought it down if I'd left it running.

- Tasks in the backfill were executed out of order despite the tasks having
`depends_on_past`

- The backfill converted all existing DAGRuns to be backfill runs that the
scheduler later ignored, which is not how I would've expected this to work
(nor was it indicated in the dry run)

I ended up having to do manual recovery work in the database to turn the
"backfill" runs back into scheduler runs, and then switch to using `airflow
clear`. I'm a heavy Airflow user and this took me an hour; it would've been
much worse for anyone else on my team.

I don't have any specific suggestions here other than to confirm that this
feature needs an overhaul if it's to be recommended to anyone.

On Fri, Jun 8, 2018 at 5:38 PM Maxime Beauchemin <ma...@gmail.com>
wrote:

> Ash I don't see how this could happen unless maybe the node doing the
> backfill is using another metadata database.
>
> In general we recommend for people to run --local backfills and have the
> default/sandbox template for `airflow.cfg` use a LocalExecutor with
> reasonable parallelism to make that behavior the default.
>
> Given the [not-so-great] state of backfill, I'm guessing many have been
> using the scheduler to do backfills. From that regard it would be nice to
> have CLI commands to generate dagruns or alter the state of existing ones
>
> Max
>
> On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
> ash_airflowlist@firemirror.com> wrote:
>
> > Somewhat related to this, but likely a different issue:
> >
> > I've just had a case where a long (7hours) running backfill task ended up
> > running twice somehow. We're using Celery so this might be related to
> some
> > sort of Celery visibility timeout, but I haven't had a chance to be able
> to
> > dig in to it in detail - it's 5pm on a Friday :D
> >
> > Has anyone else noticed anything similar?
> >
> > -ash
> >
> >
> > > On 8 Jun 2018, at 01:22, Tao Feng <fe...@gmail.com> wrote:
> > >
> > > Thanks everyone for the feedback especially on the background for
> > backfill.
> > > After reading the discussion, I think it would be safest to add a flag
> > for
> > > auto rerun failed tasks for backfill with default to be false. I have
> > > updated the pr accordingly.
> > >
> > > Thanks a lot,
> > > -Tao
> > >
> > > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> > mark.whitfield@nytimes.com>
> > > wrote:
> > >
> > >> I've been doing some work setting up a large, collaborative Airflow
> > >> pipeline with a group that makes heavy use of backfills, and have been
> > >> encountering a lot of these issues myself.
> > >>
> > >> Other gripes:
> > >>
> > >> Backfills do not obey concurrency pool restrictions. We had been
> making
> > >> heavy use of SubDAGs and using concurrency pools to prevent deadlocks
> > (why
> > >> does the SubDAG itself even need to occupy a concurrency slot if none
> of
> > >> its constituent tasks are running?), but this quickly became untenable
> > when
> > >> using backfills and we were forced to mostly abandon SubDAGs.
> > >>
> > >> Backfills do use DagRuns now, which is a big improvement. However,
> it's
> > a
> > >> common use case for us to add new tasks to a DAG and backfill to a
> date
> > >> specific to that task. When we do this, the BackfillJob will pick up
> > >> previous backfill DagRuns and re-use them, which is mostly nice
> because
> > it
> > >> keeps the Tree view neatly organized in the UI. However, it does not
> > reset
> > >> the start time of the DagRun when it does this. Combined with a
> > DAG-level
> > >> timeout, this means that the backfill job will activate a DagRun, but
> > then
> > >> the run will immediately time out (since it still thinks it's been
> > running
> > >> since the previous backfill). This will cause tasks to deadlock
> > spuriously,
> > >> making backfills extremely cumbersome to carry out.
> > >>
> > >> *Mark Whitfield*
> > >> Data Scientist
> > >> New York Times
> > >>
> > >>
> > >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> > >> maximebeauchemin@gmail.com>
> > >> wrote:
> > >>
> > >>> Thanks for the input, this is helpful.
> > >>>
> > >>> To add to the list, there's some complexity around concurrency
> > management
> > >>> and multiple executors:
> > >>> I just hit this thing where backfill doesn't check DAG-level
> > concurrency,
> > >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> > concurrency
> > >>> limit and exits. Right after backfill reschedules right away and so
> on,
> > >>> burning a bunch of CPU doing nothing. In this specific case it seems
> > like
> > >>> `airflow run` should skip that specific check when in the context of
> a
> > >>> backfill.
> > >>>
> > >>> Max
> > >>>
> > >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bd...@gmail.com>
> > wrote:
> > >>>
> > >>>> Thinking out loud here, because it is a while back that I did work
> on
> > >>>> backfills. There were some real issues with backfills:
> > >>>>
> > >>>> 1. Tasks were running in non deterministic order ending up in
> regular
> > >>>> deadlocks
> > >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max dag
> runs
> > >>>> could not be enforced. Ui could really display it, lots of minor
> other
> > >>>> issues because of it.
> > >>>> 3. Behavior was different from the scheduler, while subdagoperators
> > >>>> particularly make use of backfills at the moment.
> > >>>>
> > >>>> I think with 3 the behavior you are observing crept in. And given 3
> I
> > >>>> would argue a consistent behavior between the scheduler and the
> > >> backfill
> > >>>> mechanism is still paramount. Thus we should explicitly clear tasks
> > >> from
> > >>>> failed if we want to rerun them. This at least until we move the
> > >>>> subdagoperator out of backfill and into the scheduler (which is
> > >> actually
> > >>>> not too hard). Also we need those command line options anyway.
> > >>>>
> > >>>> Bolke
> > >>>>
> > >>>> Verstuurd vanaf mijn iPad
> > >>>>
> > >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> > >> scott.halgrim@zapier.com
> > >>> .INVALID>
> > >>>> het volgende geschreven:
> > >>>>>
> > >>>>> The request was for opposition, but I’d like to weigh in on the
> side
> > >> of
> > >>>> “it’s a better behavior [to have failed tasks re-run when cleared
> in a
> > >>>> backfill"
> > >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > >>>> maximebeauchemin@gmail.com>, wrote:
> > >>>>>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> > >>> bdbruin@gmail.com>
> > >>>> I
> > >>>>>> think you may have some context on why this may have changed at
> some
> > >>>> point.
> > >>>>>> I'm assuming that when DagRun handling was added to the backfill
> > >>> logic,
> > >>>> the
> > >>>>>> behavior just happened to change to what it is now.
> > >>>>>>
> > >>>>>> Any opposition in moving back towards re-running failed tasks when
> > >>>> starting
> > >>>>>> a backfill? I think it's a better behavior, though it's a change
> in
> > >>>>>> behavior that we should mention in UPDATE.md.
> > >>>>>>
> > >>>>>> One of our goals is to make sure that a failed or killed backfill
> > >> can
> > >>> be
> > >>>>>> restarted and just seamlessly pick up where it left off.
> > >>>>>>
> > >>>>>> Max
> > >>>>>>
> > >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com>
> > >> wrote:
> > >>>>>>>
> > >>>>>>> After discussing with Max, we think it would be great if `airflow
> > >>>> backfill`
> > >>>>>>> could be able to auto pick up and rerun those failed tasks.
> > >>> Currently,
> > >>>> it
> > >>>>>>> will throw exceptions(
> > >>>>>>>
> > >>>>>>>
> > >>>>
> > >>> https://github.com/apache/incubator-airflow/blob/master/airf
> > >> low/jobs.py#L2489
> > >>>>>>> )
> > >>>>>>> without rerunning the failed tasks.
> > >>>>>>>
> > >>>>>>> But since it broke some of the previous assumptions for backfill,
> > >> we
> > >>>> would
> > >>>>>>> like to get some feedback and see if anyone has any concerns(pr
> > >> could
> > >>>> be
> > >>>>>>> found at https://github.com/apache/incu
> > >> bator-airflow/pull/3464/files
> > >>> ).
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> -Tao
> > >>>>>>>
> > >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > >>>>>>> maximebeauchemin@gmail.com> wrote:
> > >>>>>>>
> > >>>>>>>> So I'm running a backfill for what feels like the first time in
> > >>> years
> > >>>>>>> using
> > >>>>>>>> a simple `airflow backfill --local` commands.
> > >>>>>>>>
> > >>>>>>>> First I start getting a ton of `logging.info` of each tasks
> that
> > >>>> cannot
> > >>>>>>> be
> > >>>>>>>> started just yet at every tick flooding my terminal with the
> > >> keyword
> > >>>>>>>> `FAILED` in it, looking like a million of lines like this one:
> > >>>>>>>>
> > >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies
> not
> > >>> met
> > >>>>>>> for
> > >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > >>> [scheduled]>,
> > >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> > >> 'all_success'
> > >>> re
> > >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> > >>>> non-success(es).
> > >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > >>>> 'upstream_failed':
> > >>>>>>>> 0L,
> > >>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
> > >> _task_id']
> > >>>>>>>>
> > >>>>>>>> Good thing I triggered 1 month and not 2 years like I actually
> > >> need,
> > >>>> just
> > >>>>>>>> the logs here would be "big data". Now I'm unclear whether
> there's
> > >>>>>>> anything
> > >>>>>>>> actually running or if I did something wrong, so I decide to
> kill
> > >>> the
> > >>>>>>>> process so I can set a smaller date range and get a better
> picture
> > >>> of
> > >>>>>>>> what's up.
> > >>>>>>>>
> > >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I
> > >> take
> > >>> a
> > >>>>>>> note
> > >>>>>>>> that I'll need to find that log-flooding line and demote it to
> > >> DEBUG
> > >>>> in a
> > >>>>>>>> quick PR, no biggy.
> > >>>>>>>>
> > >>>>>>>> Now I restart with just a single schedule, and get an error `Dag
> > >>>>>>> {some_dag}
> > >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill
> > >>> could
> > >>>>>>> just
> > >>>>>>>> pickup where it left off. Maybe I need to run an `airflow clear`
> > >>>> command
> > >>>>>>>> and restart? Ok, ran my clear command, same error is showing up.
> > >>> Dead
> > >>>>>>> end.
> > >>>>>>>>
> > >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns` option?
> > >>>> Doesn't
> > >>>>>>>> look like it... Maybe `airflow backfill` has some new switches
> to
> > >>>> pick up
> > >>>>>>>> where it left off? Can't find it. Am I supposed to clear the DAG
> > >>> Runs
> > >>>>>>>> manually in the UI? This is a pre-production, in-development
> DAG,
> > >> so
> > >>>>>>> it's
> > >>>>>>>> not on the production web server. Am I supposed to fire up my
> own
> > >>> web
> > >>>>>>>> server to go and manually handle the backfill-related DAG Runs?
> > >>>> Cannot to
> > >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> > >>>>>>>>
> > >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete the DAG
> > >>> runs,
> > >>>> it
> > >>>>>>>> appears I can finally start over.
> > >>>>>>>>
> > >>>>>>>> Next thought was: "Alright looks like I need to go Linus on the
> > >>>> mailing
> > >>>>>>>> list".
> > >>>>>>>>
> > >>>>>>>> What am I missing? I'm really hoping these issues specific to
> > >> 1.8.2!
> > >>>>>>>>
> > >>>>>>>> Backfilling is core to Airflow and should work very well. I want
> > >> to
> > >>>>>>> restate
> > >>>>>>>> some reqs for Airflow backfill:
> > >>>>>>>> * when failing / interrupted, it should seamlessly be able to
> > >> pickup
> > >>>>>>> where
> > >>>>>>>> it left off
> > >>>>>>>> * terminal logging at the INFO level should be a clear, human
> > >>>> consumable,
> > >>>>>>>> indicator of progress
> > >>>>>>>> * backfill-related operations (including restarts) should be
> > >> doable
> > >>>>>>> through
> > >>>>>>>> CLI interactions, and not require web server interactions as the
> > >>>> typical
> > >>>>>>>> sandbox (dev environment) shouldn't assume the existence of a
> web
> > >>>> server
> > >>>>>>>>
> > >>>>>>>> Let's fix this.
> > >>>>>>>>
> > >>>>>>>> Max
> > >>>>>>>>
> > >>>>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: Is `airflow backfill` disfunctional?

Posted by Maxime Beauchemin <ma...@gmail.com>.

Ash I don't see how this could happen unless maybe the node doing the
backfill is using another metadata database.

In general we recommend for people to run --local backfills and have the
default/sandbox template for `airflow.cfg` use a LocalExecutor with
reasonable parallelism to make that behavior the default.

Given the [not-so-great] state of backfill, I'm guessing many have been
using the scheduler to do backfills. From that regard it would be nice to
have CLI commands to generate dagruns or alter the state of existing ones

Max

On Fri, Jun 8, 2018 at 8:56 AM Ash Berlin-Taylor <
ash_airflowlist@firemirror.com> wrote:

> Somewhat related to this, but likely a different issue:
>
> I've just had a case where a long (7hours) running backfill task ended up
> running twice somehow. We're using Celery so this might be related to some
> sort of Celery visibility timeout, but I haven't had a chance to be able to
> dig in to it in detail - it's 5pm on a Friday :D
>
> Has anyone else noticed anything similar?
>
> -ash
>
>
> > On 8 Jun 2018, at 01:22, Tao Feng <fe...@gmail.com> wrote:
> >
> > Thanks everyone for the feedback especially on the background for
> backfill.
> > After reading the discussion, I think it would be safest to add a flag
> for
> > auto rerun failed tasks for backfill with default to be false. I have
> > updated the pr accordingly.
> >
> > Thanks a lot,
> > -Tao
> >
> > On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <
> mark.whitfield@nytimes.com>
> > wrote:
> >
> >> I've been doing some work setting up a large, collaborative Airflow
> >> pipeline with a group that makes heavy use of backfills, and have been
> >> encountering a lot of these issues myself.
> >>
> >> Other gripes:
> >>
> >> Backfills do not obey concurrency pool restrictions. We had been making
> >> heavy use of SubDAGs and using concurrency pools to prevent deadlocks
> (why
> >> does the SubDAG itself even need to occupy a concurrency slot if none of
> >> its constituent tasks are running?), but this quickly became untenable
> when
> >> using backfills and we were forced to mostly abandon SubDAGs.
> >>
> >> Backfills do use DagRuns now, which is a big improvement. However, it's
> a
> >> common use case for us to add new tasks to a DAG and backfill to a date
> >> specific to that task. When we do this, the BackfillJob will pick up
> >> previous backfill DagRuns and re-use them, which is mostly nice because
> it
> >> keeps the Tree view neatly organized in the UI. However, it does not
> reset
> >> the start time of the DagRun when it does this. Combined with a
> DAG-level
> >> timeout, this means that the backfill job will activate a DagRun, but
> then
> >> the run will immediately time out (since it still thinks it's been
> running
> >> since the previous backfill). This will cause tasks to deadlock
> spuriously,
> >> making backfills extremely cumbersome to carry out.
> >>
> >> *Mark Whitfield*
> >> Data Scientist
> >> New York Times
> >>
> >>
> >> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> >> maximebeauchemin@gmail.com>
> >> wrote:
> >>
> >>> Thanks for the input, this is helpful.
> >>>
> >>> To add to the list, there's some complexity around concurrency
> management
> >>> and multiple executors:
> >>> I just hit this thing where backfill doesn't check DAG-level
> concurrency,
> >>> fires up 32 tasks, and `airlfow run` double-checks DAG-level
> concurrency
> >>> limit and exits. Right after backfill reschedules right away and so on,
> >>> burning a bunch of CPU doing nothing. In this specific case it seems
> like
> >>> `airflow run` should skip that specific check when in the context of a
> >>> backfill.
> >>>
> >>> Max
> >>>
> >>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bd...@gmail.com>
> wrote:
> >>>
> >>>> Thinking out loud here, because it is a while back that I did work on
> >>>> backfills. There were some real issues with backfills:
> >>>>
> >>>> 1. Tasks were running in non deterministic order ending up in regular
> >>>> deadlocks
> >>>> 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> >>>> could not be enforced. Ui could really display it, lots of minor other
> >>>> issues because of it.
> >>>> 3. Behavior was different from the scheduler, while subdagoperators
> >>>> particularly make use of backfills at the moment.
> >>>>
> >>>> I think with 3 the behavior you are observing crept in. And given 3 I
> >>>> would argue a consistent behavior between the scheduler and the
> >> backfill
> >>>> mechanism is still paramount. Thus we should explicitly clear tasks
> >> from
> >>>> failed if we want to rerun them. This at least until we move the
> >>>> subdagoperator out of backfill and into the scheduler (which is
> >> actually
> >>>> not too hard). Also we need those command line options anyway.
> >>>>
> >>>> Bolke
> >>>>
> >>>> Verstuurd vanaf mijn iPad
> >>>>
> >>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> >> scott.halgrim@zapier.com
> >>> .INVALID>
> >>>> het volgende geschreven:
> >>>>>
> >>>>> The request was for opposition, but I’d like to weigh in on the side
> >> of
> >>>> “it’s a better behavior [to have failed tasks re-run when cleared in a
> >>>> backfill"
> >>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> >>>> maximebeauchemin@gmail.com>, wrote:
> >>>>>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> >>> bdbruin@gmail.com>
> >>>> I
> >>>>>> think you may have some context on why this may have changed at some
> >>>> point.
> >>>>>> I'm assuming that when DagRun handling was added to the backfill
> >>> logic,
> >>>> the
> >>>>>> behavior just happened to change to what it is now.
> >>>>>>
> >>>>>> Any opposition in moving back towards re-running failed tasks when
> >>>> starting
> >>>>>> a backfill? I think it's a better behavior, though it's a change in
> >>>>>> behavior that we should mention in UPDATE.md.
> >>>>>>
> >>>>>> One of our goals is to make sure that a failed or killed backfill
> >> can
> >>> be
> >>>>>> restarted and just seamlessly pick up where it left off.
> >>>>>>
> >>>>>> Max
> >>>>>>
> >>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com>
> >> wrote:
> >>>>>>>
> >>>>>>> After discussing with Max, we think it would be great if `airflow
> >>>> backfill`
> >>>>>>> could be able to auto pick up and rerun those failed tasks.
> >>> Currently,
> >>>> it
> >>>>>>> will throw exceptions(
> >>>>>>>
> >>>>>>>
> >>>>
> >>> https://github.com/apache/incubator-airflow/blob/master/airf
> >> low/jobs.py#L2489
> >>>>>>> )
> >>>>>>> without rerunning the failed tasks.
> >>>>>>>
> >>>>>>> But since it broke some of the previous assumptions for backfill,
> >> we
> >>>> would
> >>>>>>> like to get some feedback and see if anyone has any concerns(pr
> >> could
> >>>> be
> >>>>>>> found at https://github.com/apache/incu
> >> bator-airflow/pull/3464/files
> >>> ).
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> -Tao
> >>>>>>>
> >>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> >>>>>>> maximebeauchemin@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> So I'm running a backfill for what feels like the first time in
> >>> years
> >>>>>>> using
> >>>>>>>> a simple `airflow backfill --local` commands.
> >>>>>>>>
> >>>>>>>> First I start getting a ton of `logging.info` of each tasks that
> >>>> cannot
> >>>>>>> be
> >>>>>>>> started just yet at every tick flooding my terminal with the
> >> keyword
> >>>>>>>> `FAILED` in it, looking like a million of lines like this one:
> >>>>>>>>
> >>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not
> >>> met
> >>>>>>> for
> >>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> >>> [scheduled]>,
> >>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> >> 'all_success'
> >>> re
> >>>>>>>> quires all upstream tasks to have succeeded, but found 1
> >>>> non-success(es).
> >>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> >>>> 'upstream_failed':
> >>>>>>>> 0L,
> >>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
> >> _task_id']
> >>>>>>>>
> >>>>>>>> Good thing I triggered 1 month and not 2 years like I actually
> >> need,
> >>>> just
> >>>>>>>> the logs here would be "big data". Now I'm unclear whether there's
> >>>>>>> anything
> >>>>>>>> actually running or if I did something wrong, so I decide to kill
> >>> the
> >>>>>>>> process so I can set a smaller date range and get a better picture
> >>> of
> >>>>>>>> what's up.
> >>>>>>>>
> >>>>>>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I
> >> take
> >>> a
> >>>>>>> note
> >>>>>>>> that I'll need to find that log-flooding line and demote it to
> >> DEBUG
> >>>> in a
> >>>>>>>> quick PR, no biggy.
> >>>>>>>>
> >>>>>>>> Now I restart with just a single schedule, and get an error `Dag
> >>>>>>> {some_dag}
> >>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill
> >>> could
> >>>>>>> just
> >>>>>>>> pickup where it left off. Maybe I need to run an `airflow clear`
> >>>> command
> >>>>>>>> and restart? Ok, ran my clear command, same error is showing up.
> >>> Dead
> >>>>>>> end.
> >>>>>>>>
> >>>>>>>> Maybe there is some new `airflow clear --reset-dagruns` option?
> >>>> Doesn't
> >>>>>>>> look like it... Maybe `airflow backfill` has some new switches to
> >>>> pick up
> >>>>>>>> where it left off? Can't find it. Am I supposed to clear the DAG
> >>> Runs
> >>>>>>>> manually in the UI? This is a pre-production, in-development DAG,
> >> so
> >>>>>>> it's
> >>>>>>>> not on the production web server. Am I supposed to fire up my own
> >>> web
> >>>>>>>> server to go and manually handle the backfill-related DAG Runs?
> >>>> Cannot to
> >>>>>>>> my staging MySQL and do manually clear some DAG runs?
> >>>>>>>>
> >>>>>>>> So. Fire up a web server, navigate to my dag_id, delete the DAG
> >>> runs,
> >>>> it
> >>>>>>>> appears I can finally start over.
> >>>>>>>>
> >>>>>>>> Next thought was: "Alright looks like I need to go Linus on the
> >>>> mailing
> >>>>>>>> list".
> >>>>>>>>
> >>>>>>>> What am I missing? I'm really hoping these issues specific to
> >> 1.8.2!
> >>>>>>>>
> >>>>>>>> Backfilling is core to Airflow and should work very well. I want
> >> to
> >>>>>>> restate
> >>>>>>>> some reqs for Airflow backfill:
> >>>>>>>> * when failing / interrupted, it should seamlessly be able to
> >> pickup
> >>>>>>> where
> >>>>>>>> it left off
> >>>>>>>> * terminal logging at the INFO level should be a clear, human
> >>>> consumable,
> >>>>>>>> indicator of progress
> >>>>>>>> * backfill-related operations (including restarts) should be
> >> doable
> >>>>>>> through
> >>>>>>>> CLI interactions, and not require web server interactions as the
> >>>> typical
> >>>>>>>> sandbox (dev environment) shouldn't assume the existence of a web
> >>>> server
> >>>>>>>>
> >>>>>>>> Let's fix this.
> >>>>>>>>
> >>>>>>>> Max
> >>>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
>
>

Re: Is `airflow backfill` disfunctional?

Posted by Ash Berlin-Taylor <as...@firemirror.com>.

Somewhat related to this, but likely a different issue:

I've just had a case where a long (7hours) running backfill task ended up running twice somehow. We're using Celery so this might be related to some sort of Celery visibility timeout, but I haven't had a chance to be able to dig in to it in detail - it's 5pm on a Friday :D

Has anyone else noticed anything similar?

-ash


> On 8 Jun 2018, at 01:22, Tao Feng <fe...@gmail.com> wrote:
> 
> Thanks everyone for the feedback especially on the background for backfill.
> After reading the discussion, I think it would be safest to add a flag for
> auto rerun failed tasks for backfill with default to be false. I have
> updated the pr accordingly.
> 
> Thanks a lot,
> -Tao
> 
> On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <ma...@nytimes.com>
> wrote:
> 
>> I've been doing some work setting up a large, collaborative Airflow
>> pipeline with a group that makes heavy use of backfills, and have been
>> encountering a lot of these issues myself.
>> 
>> Other gripes:
>> 
>> Backfills do not obey concurrency pool restrictions. We had been making
>> heavy use of SubDAGs and using concurrency pools to prevent deadlocks (why
>> does the SubDAG itself even need to occupy a concurrency slot if none of
>> its constituent tasks are running?), but this quickly became untenable when
>> using backfills and we were forced to mostly abandon SubDAGs.
>> 
>> Backfills do use DagRuns now, which is a big improvement. However, it's a
>> common use case for us to add new tasks to a DAG and backfill to a date
>> specific to that task. When we do this, the BackfillJob will pick up
>> previous backfill DagRuns and re-use them, which is mostly nice because it
>> keeps the Tree view neatly organized in the UI. However, it does not reset
>> the start time of the DagRun when it does this. Combined with a DAG-level
>> timeout, this means that the backfill job will activate a DagRun, but then
>> the run will immediately time out (since it still thinks it's been running
>> since the previous backfill). This will cause tasks to deadlock spuriously,
>> making backfills extremely cumbersome to carry out.
>> 
>> *Mark Whitfield*
>> Data Scientist
>> New York Times
>> 
>> 
>> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
>> maximebeauchemin@gmail.com>
>> wrote:
>> 
>>> Thanks for the input, this is helpful.
>>> 
>>> To add to the list, there's some complexity around concurrency management
>>> and multiple executors:
>>> I just hit this thing where backfill doesn't check DAG-level concurrency,
>>> fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency
>>> limit and exits. Right after backfill reschedules right away and so on,
>>> burning a bunch of CPU doing nothing. In this specific case it seems like
>>> `airflow run` should skip that specific check when in the context of a
>>> backfill.
>>> 
>>> Max
>>> 
>>> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bd...@gmail.com> wrote:
>>> 
>>>> Thinking out loud here, because it is a while back that I did work on
>>>> backfills. There were some real issues with backfills:
>>>> 
>>>> 1. Tasks were running in non deterministic order ending up in regular
>>>> deadlocks
>>>> 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
>>>> could not be enforced. Ui could really display it, lots of minor other
>>>> issues because of it.
>>>> 3. Behavior was different from the scheduler, while subdagoperators
>>>> particularly make use of backfills at the moment.
>>>> 
>>>> I think with 3 the behavior you are observing crept in. And given 3 I
>>>> would argue a consistent behavior between the scheduler and the
>> backfill
>>>> mechanism is still paramount. Thus we should explicitly clear tasks
>> from
>>>> failed if we want to rerun them. This at least until we move the
>>>> subdagoperator out of backfill and into the scheduler (which is
>> actually
>>>> not too hard). Also we need those command line options anyway.
>>>> 
>>>> Bolke
>>>> 
>>>> Verstuurd vanaf mijn iPad
>>>> 
>>>>> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
>> scott.halgrim@zapier.com
>>> .INVALID>
>>>> het volgende geschreven:
>>>>> 
>>>>> The request was for opposition, but I’d like to weigh in on the side
>> of
>>>> “it’s a better behavior [to have failed tasks re-run when cleared in a
>>>> backfill"
>>>>>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
>>>> maximebeauchemin@gmail.com>, wrote:
>>>>>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
>>> bdbruin@gmail.com>
>>>> I
>>>>>> think you may have some context on why this may have changed at some
>>>> point.
>>>>>> I'm assuming that when DagRun handling was added to the backfill
>>> logic,
>>>> the
>>>>>> behavior just happened to change to what it is now.
>>>>>> 
>>>>>> Any opposition in moving back towards re-running failed tasks when
>>>> starting
>>>>>> a backfill? I think it's a better behavior, though it's a change in
>>>>>> behavior that we should mention in UPDATE.md.
>>>>>> 
>>>>>> One of our goals is to make sure that a failed or killed backfill
>> can
>>> be
>>>>>> restarted and just seamlessly pick up where it left off.
>>>>>> 
>>>>>> Max
>>>>>> 
>>>>>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> After discussing with Max, we think it would be great if `airflow
>>>> backfill`
>>>>>>> could be able to auto pick up and rerun those failed tasks.
>>> Currently,
>>>> it
>>>>>>> will throw exceptions(
>>>>>>> 
>>>>>>> 
>>>> 
>>> https://github.com/apache/incubator-airflow/blob/master/airf
>> low/jobs.py#L2489
>>>>>>> )
>>>>>>> without rerunning the failed tasks.
>>>>>>> 
>>>>>>> But since it broke some of the previous assumptions for backfill,
>> we
>>>> would
>>>>>>> like to get some feedback and see if anyone has any concerns(pr
>> could
>>>> be
>>>>>>> found at https://github.com/apache/incu
>> bator-airflow/pull/3464/files
>>> ).
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> -Tao
>>>>>>> 
>>>>>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
>>>>>>> maximebeauchemin@gmail.com> wrote:
>>>>>>> 
>>>>>>>> So I'm running a backfill for what feels like the first time in
>>> years
>>>>>>> using
>>>>>>>> a simple `airflow backfill --local` commands.
>>>>>>>> 
>>>>>>>> First I start getting a ton of `logging.info` of each tasks that
>>>> cannot
>>>>>>> be
>>>>>>>> started just yet at every tick flooding my terminal with the
>> keyword
>>>>>>>> `FAILED` in it, looking like a million of lines like this one:
>>>>>>>> 
>>>>>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not
>>> met
>>>>>>> for
>>>>>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
>>> [scheduled]>,
>>>>>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
>> 'all_success'
>>> re
>>>>>>>> quires all upstream tasks to have succeeded, but found 1
>>>> non-success(es).
>>>>>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
>>>> 'upstream_failed':
>>>>>>>> 0L,
>>>>>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
>> _task_id']
>>>>>>>> 
>>>>>>>> Good thing I triggered 1 month and not 2 years like I actually
>> need,
>>>> just
>>>>>>>> the logs here would be "big data". Now I'm unclear whether there's
>>>>>>> anything
>>>>>>>> actually running or if I did something wrong, so I decide to kill
>>> the
>>>>>>>> process so I can set a smaller date range and get a better picture
>>> of
>>>>>>>> what's up.
>>>>>>>> 
>>>>>>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I
>> take
>>> a
>>>>>>> note
>>>>>>>> that I'll need to find that log-flooding line and demote it to
>> DEBUG
>>>> in a
>>>>>>>> quick PR, no biggy.
>>>>>>>> 
>>>>>>>> Now I restart with just a single schedule, and get an error `Dag
>>>>>>> {some_dag}
>>>>>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill
>>> could
>>>>>>> just
>>>>>>>> pickup where it left off. Maybe I need to run an `airflow clear`
>>>> command
>>>>>>>> and restart? Ok, ran my clear command, same error is showing up.
>>> Dead
>>>>>>> end.
>>>>>>>> 
>>>>>>>> Maybe there is some new `airflow clear --reset-dagruns` option?
>>>> Doesn't
>>>>>>>> look like it... Maybe `airflow backfill` has some new switches to
>>>> pick up
>>>>>>>> where it left off? Can't find it. Am I supposed to clear the DAG
>>> Runs
>>>>>>>> manually in the UI? This is a pre-production, in-development DAG,
>> so
>>>>>>> it's
>>>>>>>> not on the production web server. Am I supposed to fire up my own
>>> web
>>>>>>>> server to go and manually handle the backfill-related DAG Runs?
>>>> Cannot to
>>>>>>>> my staging MySQL and do manually clear some DAG runs?
>>>>>>>> 
>>>>>>>> So. Fire up a web server, navigate to my dag_id, delete the DAG
>>> runs,
>>>> it
>>>>>>>> appears I can finally start over.
>>>>>>>> 
>>>>>>>> Next thought was: "Alright looks like I need to go Linus on the
>>>> mailing
>>>>>>>> list".
>>>>>>>> 
>>>>>>>> What am I missing? I'm really hoping these issues specific to
>> 1.8.2!
>>>>>>>> 
>>>>>>>> Backfilling is core to Airflow and should work very well. I want
>> to
>>>>>>> restate
>>>>>>>> some reqs for Airflow backfill:
>>>>>>>> * when failing / interrupted, it should seamlessly be able to
>> pickup
>>>>>>> where
>>>>>>>> it left off
>>>>>>>> * terminal logging at the INFO level should be a clear, human
>>>> consumable,
>>>>>>>> indicator of progress
>>>>>>>> * backfill-related operations (including restarts) should be
>> doable
>>>>>>> through
>>>>>>>> CLI interactions, and not require web server interactions as the
>>>> typical
>>>>>>>> sandbox (dev environment) shouldn't assume the existence of a web
>>>> server
>>>>>>>> 
>>>>>>>> Let's fix this.
>>>>>>>> 
>>>>>>>> Max
>>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>>

Re: Is `airflow backfill` disfunctional?

Posted by Tao Feng <fe...@gmail.com>.

Thanks everyone for the feedback especially on the background for backfill.
After reading the discussion, I think it would be safest to add a flag for
auto rerun failed tasks for backfill with default to be false. I have
updated the pr accordingly.

Thanks a lot,
-Tao

On Wed, Jun 6, 2018 at 1:47 PM, Mark Whitfield <ma...@nytimes.com>
wrote:

> I've been doing some work setting up a large, collaborative Airflow
> pipeline with a group that makes heavy use of backfills, and have been
> encountering a lot of these issues myself.
>
> Other gripes:
>
> Backfills do not obey concurrency pool restrictions. We had been making
> heavy use of SubDAGs and using concurrency pools to prevent deadlocks (why
> does the SubDAG itself even need to occupy a concurrency slot if none of
> its constituent tasks are running?), but this quickly became untenable when
> using backfills and we were forced to mostly abandon SubDAGs.
>
> Backfills do use DagRuns now, which is a big improvement. However, it's a
> common use case for us to add new tasks to a DAG and backfill to a date
> specific to that task. When we do this, the BackfillJob will pick up
> previous backfill DagRuns and re-use them, which is mostly nice because it
> keeps the Tree view neatly organized in the UI. However, it does not reset
> the start time of the DagRun when it does this. Combined with a DAG-level
> timeout, this means that the backfill job will activate a DagRun, but then
> the run will immediately time out (since it still thinks it's been running
> since the previous backfill). This will cause tasks to deadlock spuriously,
> making backfills extremely cumbersome to carry out.
>
> *Mark Whitfield*
> Data Scientist
> New York Times
>
>
> On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <
> maximebeauchemin@gmail.com>
> wrote:
>
> > Thanks for the input, this is helpful.
> >
> > To add to the list, there's some complexity around concurrency management
> > and multiple executors:
> > I just hit this thing where backfill doesn't check DAG-level concurrency,
> > fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency
> > limit and exits. Right after backfill reschedules right away and so on,
> > burning a bunch of CPU doing nothing. In this specific case it seems like
> > `airflow run` should skip that specific check when in the context of a
> > backfill.
> >
> > Max
> >
> > On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bd...@gmail.com> wrote:
> >
> > > Thinking out loud here, because it is a while back that I did work on
> > > backfills. There were some real issues with backfills:
> > >
> > > 1. Tasks were running in non deterministic order ending up in regular
> > > deadlocks
> > > 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> > > could not be enforced. Ui could really display it, lots of minor other
> > > issues because of it.
> > > 3. Behavior was different from the scheduler, while subdagoperators
> > > particularly make use of backfills at the moment.
> > >
> > > I think with 3 the behavior you are observing crept in. And given 3 I
> > > would argue a consistent behavior between the scheduler and the
> backfill
> > > mechanism is still paramount. Thus we should explicitly clear tasks
> from
> > > failed if we want to rerun them. This at least until we move the
> > > subdagoperator out of backfill and into the scheduler (which is
> actually
> > > not too hard). Also we need those command line options anyway.
> > >
> > > Bolke
> > >
> > > Verstuurd vanaf mijn iPad
> > >
> > > > Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <
> scott.halgrim@zapier.com
> > .INVALID>
> > > het volgende geschreven:
> > > >
> > > > The request was for opposition, but I’d like to weigh in on the side
> of
> > > “it’s a better behavior [to have failed tasks re-run when cleared in a
> > > backfill"
> > > >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > > maximebeauchemin@gmail.com>, wrote:
> > > >> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> > bdbruin@gmail.com>
> > > I
> > > >> think you may have some context on why this may have changed at some
> > > point.
> > > >> I'm assuming that when DagRun handling was added to the backfill
> > logic,
> > > the
> > > >> behavior just happened to change to what it is now.
> > > >>
> > > >> Any opposition in moving back towards re-running failed tasks when
> > > starting
> > > >> a backfill? I think it's a better behavior, though it's a change in
> > > >> behavior that we should mention in UPDATE.md.
> > > >>
> > > >> One of our goals is to make sure that a failed or killed backfill
> can
> > be
> > > >> restarted and just seamlessly pick up where it left off.
> > > >>
> > > >> Max
> > > >>
> > > >>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com>
> wrote:
> > > >>>
> > > >>> After discussing with Max, we think it would be great if `airflow
> > > backfill`
> > > >>> could be able to auto pick up and rerun those failed tasks.
> > Currently,
> > > it
> > > >>> will throw exceptions(
> > > >>>
> > > >>>
> > >
> > https://github.com/apache/incubator-airflow/blob/master/airf
> low/jobs.py#L2489
> > > >>> )
> > > >>> without rerunning the failed tasks.
> > > >>>
> > > >>> But since it broke some of the previous assumptions for backfill,
> we
> > > would
> > > >>> like to get some feedback and see if anyone has any concerns(pr
> could
> > > be
> > > >>> found at https://github.com/apache/incu
> bator-airflow/pull/3464/files
> > ).
> > > >>>
> > > >>> Thanks,
> > > >>> -Tao
> > > >>>
> > > >>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > > >>> maximebeauchemin@gmail.com> wrote:
> > > >>>
> > > >>>> So I'm running a backfill for what feels like the first time in
> > years
> > > >>> using
> > > >>>> a simple `airflow backfill --local` commands.
> > > >>>>
> > > >>>> First I start getting a ton of `logging.info` of each tasks that
> > > cannot
> > > >>> be
> > > >>>> started just yet at every tick flooding my terminal with the
> keyword
> > > >>>> `FAILED` in it, looking like a million of lines like this one:
> > > >>>>
> > > >>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not
> > met
> > > >>> for
> > > >>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> > [scheduled]>,
> > > >>>> dependency 'Trigger Rule' FAILED: Task's trigger rule
> 'all_success'
> > re
> > > >>>> quires all upstream tasks to have succeeded, but found 1
> > > non-success(es).
> > > >>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > > 'upstream_failed':
> > > >>>> 0L,
> > > >>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other
> _task_id']
> > > >>>>
> > > >>>> Good thing I triggered 1 month and not 2 years like I actually
> need,
> > > just
> > > >>>> the logs here would be "big data". Now I'm unclear whether there's
> > > >>> anything
> > > >>>> actually running or if I did something wrong, so I decide to kill
> > the
> > > >>>> process so I can set a smaller date range and get a better picture
> > of
> > > >>>> what's up.
> > > >>>>
> > > >>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I
> take
> > a
> > > >>> note
> > > >>>> that I'll need to find that log-flooding line and demote it to
> DEBUG
> > > in a
> > > >>>> quick PR, no biggy.
> > > >>>>
> > > >>>> Now I restart with just a single schedule, and get an error `Dag
> > > >>> {some_dag}
> > > >>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill
> > could
> > > >>> just
> > > >>>> pickup where it left off. Maybe I need to run an `airflow clear`
> > > command
> > > >>>> and restart? Ok, ran my clear command, same error is showing up.
> > Dead
> > > >>> end.
> > > >>>>
> > > >>>> Maybe there is some new `airflow clear --reset-dagruns` option?
> > > Doesn't
> > > >>>> look like it... Maybe `airflow backfill` has some new switches to
> > > pick up
> > > >>>> where it left off? Can't find it. Am I supposed to clear the DAG
> > Runs
> > > >>>> manually in the UI? This is a pre-production, in-development DAG,
> so
> > > >>> it's
> > > >>>> not on the production web server. Am I supposed to fire up my own
> > web
> > > >>>> server to go and manually handle the backfill-related DAG Runs?
> > > Cannot to
> > > >>>> my staging MySQL and do manually clear some DAG runs?
> > > >>>>
> > > >>>> So. Fire up a web server, navigate to my dag_id, delete the DAG
> > runs,
> > > it
> > > >>>> appears I can finally start over.
> > > >>>>
> > > >>>> Next thought was: "Alright looks like I need to go Linus on the
> > > mailing
> > > >>>> list".
> > > >>>>
> > > >>>> What am I missing? I'm really hoping these issues specific to
> 1.8.2!
> > > >>>>
> > > >>>> Backfilling is core to Airflow and should work very well. I want
> to
> > > >>> restate
> > > >>>> some reqs for Airflow backfill:
> > > >>>> * when failing / interrupted, it should seamlessly be able to
> pickup
> > > >>> where
> > > >>>> it left off
> > > >>>> * terminal logging at the INFO level should be a clear, human
> > > consumable,
> > > >>>> indicator of progress
> > > >>>> * backfill-related operations (including restarts) should be
> doable
> > > >>> through
> > > >>>> CLI interactions, and not require web server interactions as the
> > > typical
> > > >>>> sandbox (dev environment) shouldn't assume the existence of a web
> > > server
> > > >>>>
> > > >>>> Let's fix this.
> > > >>>>
> > > >>>> Max
> > > >>>>
> > > >>>
> > >
> >
>

Re: Is `airflow backfill` disfunctional?

Posted by Mark Whitfield <ma...@nytimes.com>.

I've been doing some work setting up a large, collaborative Airflow
pipeline with a group that makes heavy use of backfills, and have been
encountering a lot of these issues myself.

Other gripes:

Backfills do not obey concurrency pool restrictions. We had been making
heavy use of SubDAGs and using concurrency pools to prevent deadlocks (why
does the SubDAG itself even need to occupy a concurrency slot if none of
its constituent tasks are running?), but this quickly became untenable when
using backfills and we were forced to mostly abandon SubDAGs.

Backfills do use DagRuns now, which is a big improvement. However, it's a
common use case for us to add new tasks to a DAG and backfill to a date
specific to that task. When we do this, the BackfillJob will pick up
previous backfill DagRuns and re-use them, which is mostly nice because it
keeps the Tree view neatly organized in the UI. However, it does not reset
the start time of the DagRun when it does this. Combined with a DAG-level
timeout, this means that the backfill job will activate a DagRun, but then
the run will immediately time out (since it still thinks it's been running
since the previous backfill). This will cause tasks to deadlock spuriously,
making backfills extremely cumbersome to carry out.

*Mark Whitfield*
Data Scientist
New York Times


On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <ma...@gmail.com>
wrote:

> Thanks for the input, this is helpful.
>
> To add to the list, there's some complexity around concurrency management
> and multiple executors:
> I just hit this thing where backfill doesn't check DAG-level concurrency,
> fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency
> limit and exits. Right after backfill reschedules right away and so on,
> burning a bunch of CPU doing nothing. In this specific case it seems like
> `airflow run` should skip that specific check when in the context of a
> backfill.
>
> Max
>
> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bd...@gmail.com> wrote:
>
> > Thinking out loud here, because it is a while back that I did work on
> > backfills. There were some real issues with backfills:
> >
> > 1. Tasks were running in non deterministic order ending up in regular
> > deadlocks
> > 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> > could not be enforced. Ui could really display it, lots of minor other
> > issues because of it.
> > 3. Behavior was different from the scheduler, while subdagoperators
> > particularly make use of backfills at the moment.
> >
> > I think with 3 the behavior you are observing crept in. And given 3 I
> > would argue a consistent behavior between the scheduler and the backfill
> > mechanism is still paramount. Thus we should explicitly clear tasks from
> > failed if we want to rerun them. This at least until we move the
> > subdagoperator out of backfill and into the scheduler (which is actually
> > not too hard). Also we need those command line options anyway.
> >
> > Bolke
> >
> > Verstuurd vanaf mijn iPad
> >
> > > Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <scott.halgrim@zapier.com
> .INVALID>
> > het volgende geschreven:
> > >
> > > The request was for opposition, but I’d like to weigh in on the side of
> > “it’s a better behavior [to have failed tasks re-run when cleared in a
> > backfill"
> > >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > maximebeauchemin@gmail.com>, wrote:
> > >> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> bdbruin@gmail.com>
> > I
> > >> think you may have some context on why this may have changed at some
> > point.
> > >> I'm assuming that when DagRun handling was added to the backfill
> logic,
> > the
> > >> behavior just happened to change to what it is now.
> > >>
> > >> Any opposition in moving back towards re-running failed tasks when
> > starting
> > >> a backfill? I think it's a better behavior, though it's a change in
> > >> behavior that we should mention in UPDATE.md.
> > >>
> > >> One of our goals is to make sure that a failed or killed backfill can
> be
> > >> restarted and just seamlessly pick up where it left off.
> > >>
> > >> Max
> > >>
> > >>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com> wrote:
> > >>>
> > >>> After discussing with Max, we think it would be great if `airflow
> > backfill`
> > >>> could be able to auto pick up and rerun those failed tasks.
> Currently,
> > it
> > >>> will throw exceptions(
> > >>>
> > >>>
> >
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> > >>> )
> > >>> without rerunning the failed tasks.
> > >>>
> > >>> But since it broke some of the previous assumptions for backfill, we
> > would
> > >>> like to get some feedback and see if anyone has any concerns(pr could
> > be
> > >>> found at https://github.com/apache/incubator-airflow/pull/3464/files
> ).
> > >>>
> > >>> Thanks,
> > >>> -Tao
> > >>>
> > >>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > >>> maximebeauchemin@gmail.com> wrote:
> > >>>
> > >>>> So I'm running a backfill for what feels like the first time in
> years
> > >>> using
> > >>>> a simple `airflow backfill --local` commands.
> > >>>>
> > >>>> First I start getting a ton of `logging.info` of each tasks that
> > cannot
> > >>> be
> > >>>> started just yet at every tick flooding my terminal with the keyword
> > >>>> `FAILED` in it, looking like a million of lines like this one:
> > >>>>
> > >>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not
> met
> > >>> for
> > >>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> [scheduled]>,
> > >>>> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success'
> re
> > >>>> quires all upstream tasks to have succeeded, but found 1
> > non-success(es).
> > >>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > 'upstream_failed':
> > >>>> 0L,
> > >>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
> > >>>>
> > >>>> Good thing I triggered 1 month and not 2 years like I actually need,
> > just
> > >>>> the logs here would be "big data". Now I'm unclear whether there's
> > >>> anything
> > >>>> actually running or if I did something wrong, so I decide to kill
> the
> > >>>> process so I can set a smaller date range and get a better picture
> of
> > >>>> what's up.
> > >>>>
> > >>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I take
> a
> > >>> note
> > >>>> that I'll need to find that log-flooding line and demote it to DEBUG
> > in a
> > >>>> quick PR, no biggy.
> > >>>>
> > >>>> Now I restart with just a single schedule, and get an error `Dag
> > >>> {some_dag}
> > >>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill
> could
> > >>> just
> > >>>> pickup where it left off. Maybe I need to run an `airflow clear`
> > command
> > >>>> and restart? Ok, ran my clear command, same error is showing up.
> Dead
> > >>> end.
> > >>>>
> > >>>> Maybe there is some new `airflow clear --reset-dagruns` option?
> > Doesn't
> > >>>> look like it... Maybe `airflow backfill` has some new switches to
> > pick up
> > >>>> where it left off? Can't find it. Am I supposed to clear the DAG
> Runs
> > >>>> manually in the UI? This is a pre-production, in-development DAG, so
> > >>> it's
> > >>>> not on the production web server. Am I supposed to fire up my own
> web
> > >>>> server to go and manually handle the backfill-related DAG Runs?
> > Cannot to
> > >>>> my staging MySQL and do manually clear some DAG runs?
> > >>>>
> > >>>> So. Fire up a web server, navigate to my dag_id, delete the DAG
> runs,
> > it
> > >>>> appears I can finally start over.
> > >>>>
> > >>>> Next thought was: "Alright looks like I need to go Linus on the
> > mailing
> > >>>> list".
> > >>>>
> > >>>> What am I missing? I'm really hoping these issues specific to 1.8.2!
> > >>>>
> > >>>> Backfilling is core to Airflow and should work very well. I want to
> > >>> restate
> > >>>> some reqs for Airflow backfill:
> > >>>> * when failing / interrupted, it should seamlessly be able to pickup
> > >>> where
> > >>>> it left off
> > >>>> * terminal logging at the INFO level should be a clear, human
> > consumable,
> > >>>> indicator of progress
> > >>>> * backfill-related operations (including restarts) should be doable
> > >>> through
> > >>>> CLI interactions, and not require web server interactions as the
> > typical
> > >>>> sandbox (dev environment) shouldn't assume the existence of a web
> > server
> > >>>>
> > >>>> Let's fix this.
> > >>>>
> > >>>> Max
> > >>>>
> > >>>
> >
>

Re: Is `airflow backfill` disfunctional?

Posted by Jeremiah Lowin <jl...@apache.org>.

Similarly, it's been a while since I touched the backfill code -- my last
commit was more than 2 years ago apparently!! -- so it's certainly
progressed considerably from my early changes. However, to echo Bolke, the
biggest issue with Backfill was that it behaved differently than the
scheduler, which was especially problematic because it was actually used by
the scheduler to run SubDagOperators. And I agree, the behavior you're
seeing probably was a consequence of introducing DagRuns as a unifying
concept.

So if an "automaticly clear failed operators" option is desirable (as in
the PR), then I would suggest making it an option that is False by default.
That way SubDagOperators will continue to operate properly in the
scheduler, but users who want to take advantage of that functionality can
simply enable it for manual backfills.




On Wed, Jun 6, 2018 at 3:33 PM Maxime Beauchemin <ma...@gmail.com>
wrote:

> Thanks for the input, this is helpful.
>
> To add to the list, there's some complexity around concurrency management
> and multiple executors:
> I just hit this thing where backfill doesn't check DAG-level concurrency,
> fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency
> limit and exits. Right after backfill reschedules right away and so on,
> burning a bunch of CPU doing nothing. In this specific case it seems like
> `airflow run` should skip that specific check when in the context of a
> backfill.
>
> Max
>
> On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bd...@gmail.com> wrote:
>
> > Thinking out loud here, because it is a while back that I did work on
> > backfills. There were some real issues with backfills:
> >
> > 1. Tasks were running in non deterministic order ending up in regular
> > deadlocks
> > 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> > could not be enforced. Ui could really display it, lots of minor other
> > issues because of it.
> > 3. Behavior was different from the scheduler, while subdagoperators
> > particularly make use of backfills at the moment.
> >
> > I think with 3 the behavior you are observing crept in. And given 3 I
> > would argue a consistent behavior between the scheduler and the backfill
> > mechanism is still paramount. Thus we should explicitly clear tasks from
> > failed if we want to rerun them. This at least until we move the
> > subdagoperator out of backfill and into the scheduler (which is actually
> > not too hard). Also we need those command line options anyway.
> >
> > Bolke
> >
> > Verstuurd vanaf mijn iPad
> >
> > > Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <scott.halgrim@zapier.com
> .INVALID>
> > het volgende geschreven:
> > >
> > > The request was for opposition, but I’d like to weigh in on the side of
> > “it’s a better behavior [to have failed tasks re-run when cleared in a
> > backfill"
> > >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> > maximebeauchemin@gmail.com>, wrote:
> > >> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <
> bdbruin@gmail.com>
> > I
> > >> think you may have some context on why this may have changed at some
> > point.
> > >> I'm assuming that when DagRun handling was added to the backfill
> logic,
> > the
> > >> behavior just happened to change to what it is now.
> > >>
> > >> Any opposition in moving back towards re-running failed tasks when
> > starting
> > >> a backfill? I think it's a better behavior, though it's a change in
> > >> behavior that we should mention in UPDATE.md.
> > >>
> > >> One of our goals is to make sure that a failed or killed backfill can
> be
> > >> restarted and just seamlessly pick up where it left off.
> > >>
> > >> Max
> > >>
> > >>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com> wrote:
> > >>>
> > >>> After discussing with Max, we think it would be great if `airflow
> > backfill`
> > >>> could be able to auto pick up and rerun those failed tasks.
> Currently,
> > it
> > >>> will throw exceptions(
> > >>>
> > >>>
> >
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> > >>> )
> > >>> without rerunning the failed tasks.
> > >>>
> > >>> But since it broke some of the previous assumptions for backfill, we
> > would
> > >>> like to get some feedback and see if anyone has any concerns(pr could
> > be
> > >>> found at https://github.com/apache/incubator-airflow/pull/3464/files
> ).
> > >>>
> > >>> Thanks,
> > >>> -Tao
> > >>>
> > >>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > >>> maximebeauchemin@gmail.com> wrote:
> > >>>
> > >>>> So I'm running a backfill for what feels like the first time in
> years
> > >>> using
> > >>>> a simple `airflow backfill --local` commands.
> > >>>>
> > >>>> First I start getting a ton of `logging.info` of each tasks that
> > cannot
> > >>> be
> > >>>> started just yet at every tick flooding my terminal with the keyword
> > >>>> `FAILED` in it, looking like a million of lines like this one:
> > >>>>
> > >>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not
> met
> > >>> for
> > >>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00
> [scheduled]>,
> > >>>> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success'
> re
> > >>>> quires all upstream tasks to have succeeded, but found 1
> > non-success(es).
> > >>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> > 'upstream_failed':
> > >>>> 0L,
> > >>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
> > >>>>
> > >>>> Good thing I triggered 1 month and not 2 years like I actually need,
> > just
> > >>>> the logs here would be "big data". Now I'm unclear whether there's
> > >>> anything
> > >>>> actually running or if I did something wrong, so I decide to kill
> the
> > >>>> process so I can set a smaller date range and get a better picture
> of
> > >>>> what's up.
> > >>>>
> > >>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I take
> a
> > >>> note
> > >>>> that I'll need to find that log-flooding line and demote it to DEBUG
> > in a
> > >>>> quick PR, no biggy.
> > >>>>
> > >>>> Now I restart with just a single schedule, and get an error `Dag
> > >>> {some_dag}
> > >>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill
> could
> > >>> just
> > >>>> pickup where it left off. Maybe I need to run an `airflow clear`
> > command
> > >>>> and restart? Ok, ran my clear command, same error is showing up.
> Dead
> > >>> end.
> > >>>>
> > >>>> Maybe there is some new `airflow clear --reset-dagruns` option?
> > Doesn't
> > >>>> look like it... Maybe `airflow backfill` has some new switches to
> > pick up
> > >>>> where it left off? Can't find it. Am I supposed to clear the DAG
> Runs
> > >>>> manually in the UI? This is a pre-production, in-development DAG, so
> > >>> it's
> > >>>> not on the production web server. Am I supposed to fire up my own
> web
> > >>>> server to go and manually handle the backfill-related DAG Runs?
> > Cannot to
> > >>>> my staging MySQL and do manually clear some DAG runs?
> > >>>>
> > >>>> So. Fire up a web server, navigate to my dag_id, delete the DAG
> runs,
> > it
> > >>>> appears I can finally start over.
> > >>>>
> > >>>> Next thought was: "Alright looks like I need to go Linus on the
> > mailing
> > >>>> list".
> > >>>>
> > >>>> What am I missing? I'm really hoping these issues specific to 1.8.2!
> > >>>>
> > >>>> Backfilling is core to Airflow and should work very well. I want to
> > >>> restate
> > >>>> some reqs for Airflow backfill:
> > >>>> * when failing / interrupted, it should seamlessly be able to pickup
> > >>> where
> > >>>> it left off
> > >>>> * terminal logging at the INFO level should be a clear, human
> > consumable,
> > >>>> indicator of progress
> > >>>> * backfill-related operations (including restarts) should be doable
> > >>> through
> > >>>> CLI interactions, and not require web server interactions as the
> > typical
> > >>>> sandbox (dev environment) shouldn't assume the existence of a web
> > server
> > >>>>
> > >>>> Let's fix this.
> > >>>>
> > >>>> Max
> > >>>>
> > >>>
> >
>

Re: Is `airflow backfill` disfunctional?

Posted by Maxime Beauchemin <ma...@gmail.com>.

Thanks for the input, this is helpful.

To add to the list, there's some complexity around concurrency management
and multiple executors:
I just hit this thing where backfill doesn't check DAG-level concurrency,
fires up 32 tasks, and `airlfow run` double-checks DAG-level concurrency
limit and exits. Right after backfill reschedules right away and so on,
burning a bunch of CPU doing nothing. In this specific case it seems like
`airflow run` should skip that specific check when in the context of a
backfill.

Max

On Tue, Jun 5, 2018 at 9:23 PM Bolke de Bruin <bd...@gmail.com> wrote:

> Thinking out loud here, because it is a while back that I did work on
> backfills. There were some real issues with backfills:
>
> 1. Tasks were running in non deterministic order ending up in regular
> deadlocks
> 2. Didn’t create dag runs, making behavior inconsistent. Max dag runs
> could not be enforced. Ui could really display it, lots of minor other
> issues because of it.
> 3. Behavior was different from the scheduler, while subdagoperators
> particularly make use of backfills at the moment.
>
> I think with 3 the behavior you are observing crept in. And given 3 I
> would argue a consistent behavior between the scheduler and the backfill
> mechanism is still paramount. Thus we should explicitly clear tasks from
> failed if we want to rerun them. This at least until we move the
> subdagoperator out of backfill and into the scheduler (which is actually
> not too hard). Also we need those command line options anyway.
>
> Bolke
>
> Verstuurd vanaf mijn iPad
>
> > Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <sc...@zapier.com.INVALID>
> het volgende geschreven:
> >
> > The request was for opposition, but I’d like to weigh in on the side of
> “it’s a better behavior [to have failed tasks re-run when cleared in a
> backfill"
> >> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <
> maximebeauchemin@gmail.com>, wrote:
> >> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <bd...@gmail.com>
> I
> >> think you may have some context on why this may have changed at some
> point.
> >> I'm assuming that when DagRun handling was added to the backfill logic,
> the
> >> behavior just happened to change to what it is now.
> >>
> >> Any opposition in moving back towards re-running failed tasks when
> starting
> >> a backfill? I think it's a better behavior, though it's a change in
> >> behavior that we should mention in UPDATE.md.
> >>
> >> One of our goals is to make sure that a failed or killed backfill can be
> >> restarted and just seamlessly pick up where it left off.
> >>
> >> Max
> >>
> >>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com> wrote:
> >>>
> >>> After discussing with Max, we think it would be great if `airflow
> backfill`
> >>> could be able to auto pick up and rerun those failed tasks. Currently,
> it
> >>> will throw exceptions(
> >>>
> >>>
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> >>> )
> >>> without rerunning the failed tasks.
> >>>
> >>> But since it broke some of the previous assumptions for backfill, we
> would
> >>> like to get some feedback and see if anyone has any concerns(pr could
> be
> >>> found at https://github.com/apache/incubator-airflow/pull/3464/files).
> >>>
> >>> Thanks,
> >>> -Tao
> >>>
> >>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> >>> maximebeauchemin@gmail.com> wrote:
> >>>
> >>>> So I'm running a backfill for what feels like the first time in years
> >>> using
> >>>> a simple `airflow backfill --local` commands.
> >>>>
> >>>> First I start getting a ton of `logging.info` of each tasks that
> cannot
> >>> be
> >>>> started just yet at every tick flooding my terminal with the keyword
> >>>> `FAILED` in it, looking like a million of lines like this one:
> >>>>
> >>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met
> >>> for
> >>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>,
> >>>> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
> >>>> quires all upstream tasks to have succeeded, but found 1
> non-success(es).
> >>>> upstream_tasks_state={'successes': 0L, 'failed': 0L,
> 'upstream_failed':
> >>>> 0L,
> >>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
> >>>>
> >>>> Good thing I triggered 1 month and not 2 years like I actually need,
> just
> >>>> the logs here would be "big data". Now I'm unclear whether there's
> >>> anything
> >>>> actually running or if I did something wrong, so I decide to kill the
> >>>> process so I can set a smaller date range and get a better picture of
> >>>> what's up.
> >>>>
> >>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a
> >>> note
> >>>> that I'll need to find that log-flooding line and demote it to DEBUG
> in a
> >>>> quick PR, no biggy.
> >>>>
> >>>> Now I restart with just a single schedule, and get an error `Dag
> >>> {some_dag}
> >>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could
> >>> just
> >>>> pickup where it left off. Maybe I need to run an `airflow clear`
> command
> >>>> and restart? Ok, ran my clear command, same error is showing up. Dead
> >>> end.
> >>>>
> >>>> Maybe there is some new `airflow clear --reset-dagruns` option?
> Doesn't
> >>>> look like it... Maybe `airflow backfill` has some new switches to
> pick up
> >>>> where it left off? Can't find it. Am I supposed to clear the DAG Runs
> >>>> manually in the UI? This is a pre-production, in-development DAG, so
> >>> it's
> >>>> not on the production web server. Am I supposed to fire up my own web
> >>>> server to go and manually handle the backfill-related DAG Runs?
> Cannot to
> >>>> my staging MySQL and do manually clear some DAG runs?
> >>>>
> >>>> So. Fire up a web server, navigate to my dag_id, delete the DAG runs,
> it
> >>>> appears I can finally start over.
> >>>>
> >>>> Next thought was: "Alright looks like I need to go Linus on the
> mailing
> >>>> list".
> >>>>
> >>>> What am I missing? I'm really hoping these issues specific to 1.8.2!
> >>>>
> >>>> Backfilling is core to Airflow and should work very well. I want to
> >>> restate
> >>>> some reqs for Airflow backfill:
> >>>> * when failing / interrupted, it should seamlessly be able to pickup
> >>> where
> >>>> it left off
> >>>> * terminal logging at the INFO level should be a clear, human
> consumable,
> >>>> indicator of progress
> >>>> * backfill-related operations (including restarts) should be doable
> >>> through
> >>>> CLI interactions, and not require web server interactions as the
> typical
> >>>> sandbox (dev environment) shouldn't assume the existence of a web
> server
> >>>>
> >>>> Let's fix this.
> >>>>
> >>>> Max
> >>>>
> >>>
>

Re: Is `airflow backfill` disfunctional?

Posted by Bolke de Bruin <bd...@gmail.com>.

Thinking out loud here, because it is a while back that I did work on backfills. There were some real issues with backfills:

1. Tasks were running in non deterministic order ending up in regular deadlocks
2. Didn’t create dag runs, making behavior inconsistent. Max dag runs could not be enforced. Ui could really display it, lots of minor other issues because of it.
3. Behavior was different from the scheduler, while subdagoperators particularly make use of backfills at the moment.

I think with 3 the behavior you are observing crept in. And given 3 I would argue a consistent behavior between the scheduler and the backfill mechanism is still paramount. Thus we should explicitly clear tasks from failed if we want to rerun them. This at least until we move the subdagoperator out of backfill and into the scheduler (which is actually not too hard). Also we need those command line options anyway.

Bolke

Verstuurd vanaf mijn iPad

> Op 6 jun. 2018 om 01:27 heeft Scott Halgrim <sc...@zapier.com.INVALID> het volgende geschreven:
> 
> The request was for opposition, but I’d like to weigh in on the side of “it’s a better behavior [to have failed tasks re-run when cleared in a backfill"
>> On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <ma...@gmail.com>, wrote:
>> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <bd...@gmail.com> I
>> think you may have some context on why this may have changed at some point.
>> I'm assuming that when DagRun handling was added to the backfill logic, the
>> behavior just happened to change to what it is now.
>> 
>> Any opposition in moving back towards re-running failed tasks when starting
>> a backfill? I think it's a better behavior, though it's a change in
>> behavior that we should mention in UPDATE.md.
>> 
>> One of our goals is to make sure that a failed or killed backfill can be
>> restarted and just seamlessly pick up where it left off.
>> 
>> Max
>> 
>>> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com> wrote:
>>> 
>>> After discussing with Max, we think it would be great if `airflow backfill`
>>> could be able to auto pick up and rerun those failed tasks. Currently, it
>>> will throw exceptions(
>>> 
>>> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
>>> )
>>> without rerunning the failed tasks.
>>> 
>>> But since it broke some of the previous assumptions for backfill, we would
>>> like to get some feedback and see if anyone has any concerns(pr could be
>>> found at https://github.com/apache/incubator-airflow/pull/3464/files).
>>> 
>>> Thanks,
>>> -Tao
>>> 
>>> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
>>> maximebeauchemin@gmail.com> wrote:
>>> 
>>>> So I'm running a backfill for what feels like the first time in years
>>> using
>>>> a simple `airflow backfill --local` commands.
>>>> 
>>>> First I start getting a ton of `logging.info` of each tasks that cannot
>>> be
>>>> started just yet at every tick flooding my terminal with the keyword
>>>> `FAILED` in it, looking like a million of lines like this one:
>>>> 
>>>> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met
>>> for
>>>> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>,
>>>> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
>>>> quires all upstream tasks to have succeeded, but found 1 non-success(es).
>>>> upstream_tasks_state={'successes': 0L, 'failed': 0L, 'upstream_failed':
>>>> 0L,
>>>> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
>>>> 
>>>> Good thing I triggered 1 month and not 2 years like I actually need, just
>>>> the logs here would be "big data". Now I'm unclear whether there's
>>> anything
>>>> actually running or if I did something wrong, so I decide to kill the
>>>> process so I can set a smaller date range and get a better picture of
>>>> what's up.
>>>> 
>>>> I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a
>>> note
>>>> that I'll need to find that log-flooding line and demote it to DEBUG in a
>>>> quick PR, no biggy.
>>>> 
>>>> Now I restart with just a single schedule, and get an error `Dag
>>> {some_dag}
>>>> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could
>>> just
>>>> pickup where it left off. Maybe I need to run an `airflow clear` command
>>>> and restart? Ok, ran my clear command, same error is showing up. Dead
>>> end.
>>>> 
>>>> Maybe there is some new `airflow clear --reset-dagruns` option? Doesn't
>>>> look like it... Maybe `airflow backfill` has some new switches to pick up
>>>> where it left off? Can't find it. Am I supposed to clear the DAG Runs
>>>> manually in the UI? This is a pre-production, in-development DAG, so
>>> it's
>>>> not on the production web server. Am I supposed to fire up my own web
>>>> server to go and manually handle the backfill-related DAG Runs? Cannot to
>>>> my staging MySQL and do manually clear some DAG runs?
>>>> 
>>>> So. Fire up a web server, navigate to my dag_id, delete the DAG runs, it
>>>> appears I can finally start over.
>>>> 
>>>> Next thought was: "Alright looks like I need to go Linus on the mailing
>>>> list".
>>>> 
>>>> What am I missing? I'm really hoping these issues specific to 1.8.2!
>>>> 
>>>> Backfilling is core to Airflow and should work very well. I want to
>>> restate
>>>> some reqs for Airflow backfill:
>>>> * when failing / interrupted, it should seamlessly be able to pickup
>>> where
>>>> it left off
>>>> * terminal logging at the INFO level should be a clear, human consumable,
>>>> indicator of progress
>>>> * backfill-related operations (including restarts) should be doable
>>> through
>>>> CLI interactions, and not require web server interactions as the typical
>>>> sandbox (dev environment) shouldn't assume the existence of a web server
>>>> 
>>>> Let's fix this.
>>>> 
>>>> Max
>>>> 
>>>

Re: Is `airflow backfill` disfunctional?

Posted by Scott Halgrim <sc...@zapier.com.INVALID>.

The request was for opposition, but I’d like to weigh in on the side of “it’s a better behavior [to have failed tasks re-run when cleared in a backfill"
On Jun 5, 2018, 4:16 PM -0700, Maxime Beauchemin <ma...@gmail.com>, wrote:
> @Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <bd...@gmail.com> I
> think you may have some context on why this may have changed at some point.
> I'm assuming that when DagRun handling was added to the backfill logic, the
> behavior just happened to change to what it is now.
>
> Any opposition in moving back towards re-running failed tasks when starting
> a backfill? I think it's a better behavior, though it's a change in
> behavior that we should mention in UPDATE.md.
>
> One of our goals is to make sure that a failed or killed backfill can be
> restarted and just seamlessly pick up where it left off.
>
> Max
>
> On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com> wrote:
>
> > After discussing with Max, we think it would be great if `airflow backfill`
> > could be able to auto pick up and rerun those failed tasks. Currently, it
> > will throw exceptions(
> >
> > https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> > )
> > without rerunning the failed tasks.
> >
> > But since it broke some of the previous assumptions for backfill, we would
> > like to get some feedback and see if anyone has any concerns(pr could be
> > found at https://github.com/apache/incubator-airflow/pull/3464/files).
> >
> > Thanks,
> > -Tao
> >
> > On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> > > So I'm running a backfill for what feels like the first time in years
> > using
> > > a simple `airflow backfill --local` commands.
> > >
> > > First I start getting a ton of `logging.info` of each tasks that cannot
> > be
> > > started just yet at every tick flooding my terminal with the keyword
> > > `FAILED` in it, looking like a million of lines like this one:
> > >
> > > [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met
> > for
> > > <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>,
> > > dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
> > > quires all upstream tasks to have succeeded, but found 1 non-success(es).
> > > upstream_tasks_state={'successes': 0L, 'failed': 0L, 'upstream_failed':
> > > 0L,
> > > 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
> > >
> > > Good thing I triggered 1 month and not 2 years like I actually need, just
> > > the logs here would be "big data". Now I'm unclear whether there's
> > anything
> > > actually running or if I did something wrong, so I decide to kill the
> > > process so I can set a smaller date range and get a better picture of
> > > what's up.
> > >
> > > I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a
> > note
> > > that I'll need to find that log-flooding line and demote it to DEBUG in a
> > > quick PR, no biggy.
> > >
> > > Now I restart with just a single schedule, and get an error `Dag
> > {some_dag}
> > > has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could
> > just
> > > pickup where it left off. Maybe I need to run an `airflow clear` command
> > > and restart? Ok, ran my clear command, same error is showing up. Dead
> > end.
> > >
> > > Maybe there is some new `airflow clear --reset-dagruns` option? Doesn't
> > > look like it... Maybe `airflow backfill` has some new switches to pick up
> > > where it left off? Can't find it. Am I supposed to clear the DAG Runs
> > > manually in the UI? This is a pre-production, in-development DAG, so
> > it's
> > > not on the production web server. Am I supposed to fire up my own web
> > > server to go and manually handle the backfill-related DAG Runs? Cannot to
> > > my staging MySQL and do manually clear some DAG runs?
> > >
> > > So. Fire up a web server, navigate to my dag_id, delete the DAG runs, it
> > > appears I can finally start over.
> > >
> > > Next thought was: "Alright looks like I need to go Linus on the mailing
> > > list".
> > >
> > > What am I missing? I'm really hoping these issues specific to 1.8.2!
> > >
> > > Backfilling is core to Airflow and should work very well. I want to
> > restate
> > > some reqs for Airflow backfill:
> > > * when failing / interrupted, it should seamlessly be able to pickup
> > where
> > > it left off
> > > * terminal logging at the INFO level should be a clear, human consumable,
> > > indicator of progress
> > > * backfill-related operations (including restarts) should be doable
> > through
> > > CLI interactions, and not require web server interactions as the typical
> > > sandbox (dev environment) shouldn't assume the existence of a web server
> > >
> > > Let's fix this.
> > >
> > > Max
> > >
> >

Re: Is `airflow backfill` disfunctional?

Posted by Maxime Beauchemin <ma...@gmail.com>.

@Jeremiah Lowin <jl...@gmail.com> & @Bolke de Bruin <bd...@gmail.com> I
think you may have some context on why this may have changed at some point.
I'm assuming that when DagRun handling was added to the backfill logic, the
behavior just happened to change to what it is now.

Any opposition in moving back towards re-running failed tasks when starting
a backfill? I think it's a better behavior, though it's a change in
behavior that we should mention in UPDATE.md.

One of our goals is to make sure that a failed or killed backfill can be
restarted and just seamlessly pick up where it left off.

Max

On Tue, Jun 5, 2018 at 3:25 PM Tao Feng <fe...@gmail.com> wrote:

> After discussing with Max, we think it would be great if `airflow backfill`
> could be able to auto pick up and rerun those failed tasks. Currently, it
> will throw exceptions(
>
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489
> )
> without rerunning the failed tasks.
>
> But since it broke some of the previous assumptions for backfill, we would
> like to get some feedback and see if anyone has any concerns(pr could be
> found at https://github.com/apache/incubator-airflow/pull/3464/files).
>
> Thanks,
> -Tao
>
> On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > So I'm running a backfill for what feels like the first time in years
> using
> > a simple `airflow backfill --local` commands.
> >
> > First I start getting a ton of `logging.info` of each tasks that cannot
> be
> > started just yet at every tick flooding my terminal with the keyword
> > `FAILED` in it, looking like a million of lines like this one:
> >
> > [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met
> for
> > <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>,
> > dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
> > quires all upstream tasks to have succeeded, but found 1 non-success(es).
> > upstream_tasks_state={'successes': 0L, 'failed': 0L, 'upstream_failed':
> > 0L,
> > 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
> >
> > Good thing I triggered 1 month and not 2 years like I actually need, just
> > the logs here would be "big data". Now I'm unclear whether there's
> anything
> > actually running or if I did something wrong, so I decide to kill the
> > process so I can set a smaller date range and get a better picture of
> > what's up.
> >
> > I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a
> note
> > that I'll need to find that log-flooding line and demote it to DEBUG in a
> > quick PR, no biggy.
> >
> > Now I restart with just a single schedule, and get an error `Dag
> {some_dag}
> > has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could
> just
> > pickup where it left off. Maybe I need to run an `airflow clear` command
> > and restart? Ok, ran my clear command, same error is showing up. Dead
> end.
> >
> > Maybe there is some new `airflow clear --reset-dagruns` option? Doesn't
> > look like it... Maybe `airflow backfill` has some new switches to pick up
> > where it left off? Can't find it. Am I supposed to clear the DAG Runs
> > manually in the UI?  This is a pre-production, in-development DAG, so
> it's
> > not on the production web server. Am I supposed to fire up my own web
> > server to go and manually handle the backfill-related DAG Runs? Cannot to
> > my staging MySQL and do manually clear some DAG runs?
> >
> > So. Fire up a web server, navigate to my dag_id, delete the DAG runs, it
> > appears I can finally start over.
> >
> > Next thought was: "Alright looks like I need to go Linus on the mailing
> > list".
> >
> > What am I missing? I'm really hoping these issues specific to 1.8.2!
> >
> > Backfilling is core to Airflow and should work very well. I want to
> restate
> > some reqs for Airflow backfill:
> > * when failing / interrupted, it should seamlessly be able to pickup
> where
> > it left off
> > * terminal logging at the INFO level should be a clear, human consumable,
> > indicator of progress
> > * backfill-related operations (including restarts) should be doable
> through
> > CLI interactions, and not require web server interactions as the typical
> > sandbox (dev environment) shouldn't assume the existence of a web server
> >
> > Let's fix this.
> >
> > Max
> >
>

Re: Is `airflow backfill` disfunctional?

Posted by Tao Feng <fe...@gmail.com>.

After discussing with Max, we think it would be great if `airflow backfill`
could be able to auto pick up and rerun those failed tasks. Currently, it
will throw exceptions(
https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L2489)
without rerunning the failed tasks.

But since it broke some of the previous assumptions for backfill, we would
like to get some feedback and see if anyone has any concerns(pr could be
found at https://github.com/apache/incubator-airflow/pull/3464/files).

Thanks,
-Tao

On Thu, May 24, 2018 at 10:26 AM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> So I'm running a backfill for what feels like the first time in years using
> a simple `airflow backfill --local` commands.
>
> First I start getting a ton of `logging.info` of each tasks that cannot be
> started just yet at every tick flooding my terminal with the keyword
> `FAILED` in it, looking like a million of lines like this one:
>
> [2018-05-24 14:33:07,852] {models.py:1123} INFO - Dependencies not met for
> <TaskInstance: some_dag.some_task_id 2018-01-28 00:00:00 [scheduled]>,
> dependency 'Trigger Rule' FAILED: Task's trigger rule 'all_success' re
> quires all upstream tasks to have succeeded, but found 1 non-success(es).
> upstream_tasks_state={'successes': 0L, 'failed': 0L, 'upstream_failed':
> 0L,
> 'skipped': 0L, 'done': 0L}, upstream_task_ids=['some_other_task_id']
>
> Good thing I triggered 1 month and not 2 years like I actually need, just
> the logs here would be "big data". Now I'm unclear whether there's anything
> actually running or if I did something wrong, so I decide to kill the
> process so I can set a smaller date range and get a better picture of
> what's up.
>
> I check my logging level, am I in DEBUG? Nope. Just INFO. So I take a note
> that I'll need to find that log-flooding line and demote it to DEBUG in a
> quick PR, no biggy.
>
> Now I restart with just a single schedule, and get an error `Dag {some_dag}
> has reached maximum amount of 3 dag runs`. Hmmm, I wish backfill could just
> pickup where it left off. Maybe I need to run an `airflow clear` command
> and restart? Ok, ran my clear command, same error is showing up. Dead end.
>
> Maybe there is some new `airflow clear --reset-dagruns` option? Doesn't
> look like it... Maybe `airflow backfill` has some new switches to pick up
> where it left off? Can't find it. Am I supposed to clear the DAG Runs
> manually in the UI?  This is a pre-production, in-development DAG, so it's
> not on the production web server. Am I supposed to fire up my own web
> server to go and manually handle the backfill-related DAG Runs? Cannot to
> my staging MySQL and do manually clear some DAG runs?
>
> So. Fire up a web server, navigate to my dag_id, delete the DAG runs, it
> appears I can finally start over.
>
> Next thought was: "Alright looks like I need to go Linus on the mailing
> list".
>
> What am I missing? I'm really hoping these issues specific to 1.8.2!
>
> Backfilling is core to Airflow and should work very well. I want to restate
> some reqs for Airflow backfill:
> * when failing / interrupted, it should seamlessly be able to pickup where
> it left off
> * terminal logging at the INFO level should be a clear, human consumable,
> indicator of progress
> * backfill-related operations (including restarts) should be doable through
> CLI interactions, and not require web server interactions as the typical
> sandbox (dev environment) shouldn't assume the existence of a web server
>
> Let's fix this.
>
> Max
>