You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Scott Halgrim <sc...@zapier.com.INVALID> on 2018/05/25 22:53:43 UTC

Convert Dag Run from Backfill to Scheduled?

I’ve got four months of dag runs that were scheduled dag runs, then I backfilled them. And now when I clear a task from one of those the dag run goes to “running,” but none of the tasks get scheduled (unless I manually backfill each of them)

What I really should have done here was just cleared a mid-dag task as well as all downstream tasks for these dag runs, but, well, now I’m here and I’m wondering what the best way to fix this.

Thanks!


Re: Convert Dag Run from Backfill to Scheduled?

Posted by Maxime Beauchemin <ma...@gmail.com>.
Yes, clearly the DAG runs be can in inconsistent states with related task
instances and backfill processes. Here's a quick patch that helps a little:
https://github.com/apache/incubator-airflow/pull/3433

After writing the quick patch above I'm thinking it requires a bit more
thinking. The clear command is effectively a bit of a way to issue a
"scheduler-driven backfill", maybe we can deprecate clear and have a new
"airflow backfill --scheduler", which would effectively clear task
instances and create/set DAG runs in the right state.

Max

On Tue, May 29, 2018 at 5:58 PM Ruiqin Yang <yr...@gmail.com> wrote:

> This line
> <
> https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L935
> >
> is
> where the scheduler skips the backfill DAG runs. Despite what state the DAG
> run is in, tasks in DAG run starts with 'backfill_' would not be considered
> when scheduling.
>
> I agree with Dan Davydov's idea that we should at least have something like
> multiple DAG runs for one execution to distinguish different DAG runs like
> scheduled and backfilled. The situation Scott is facing here is not the
> only case that lack of multiple DAG run has caused (e.g. manually trigging
> a task in the UI should also create a seperate DAG run, otherwise the
> implementation logic is a bit wired).
>
> Cheers,
> Kevin Y
>
> On Tue, May 29, 2018 at 5:52 PM Scott Halgrim
> <sc...@zapier.com.invalid> wrote:
>
> > Well I’ve gone ahead and run the UPDATE query now, so the scheduler is
> > picking up tasks.
> >
> > When I cleared the tasks, every DAG run that had a cleared task in it was
> > set to running. Because I’d backfilled them all they were all `backfill_`
> > dag runs.  Inspection of various tasks via `task_failed_deps` indicated
> the
> > tasks had all their dependencies filled. After running the update query,
> > they’re all `scheduled__` dag runs.
> >
> > On May 29, 2018, 5:02 PM -0700, Maxime Beauchemin <
> > maximebeauchemin@gmail.com>, wrote:
> > > While this may work it's clearly not the prescribed way to do this.
> > > Clearing should just work.
> > >
> > > I'm trying to understand why the scheduler is not picking up the
> cleared
> > > task. Clearing should remove the task instance state and set the state
> of
> > > the related DAG Run to running so that the scheduler picks those up.
> > > Perhaps there's a conflict between the backfill and scheduler-related
> DAG
> > > Runs? Which DAG runs are set to running? The backfill or
> > scheduler-related
> > > ones?
> > >
> > > Originally when I introduced DAG runs, backfill was operating without
> any
> > > consideration related to DAG runs (DAG runs were a scheduler-specific
> > > construct), later on Bolke added backfill-specific DAG runs and I'm not
> > > 100% sure how that works.
> > >
> > > Let's get to the bottom of this.
> > >
> > > Max
> > >
> > > On Fri, May 25, 2018 at 7:48 PM Ruiqin Yang <yr...@gmail.com> wrote:
> > >
> > > > If you are sure the update query targets the desired rows, the
> behavior
> > > > should be the same.
> > > >
> > > > Scott Halgrim <sc...@zapier.com.invalid>于2018年5月25日
> > 周五下午4:23写道:
> > > >
> > > > > So far no ill effects from:
> > > > >
> > > > > update dag_run
> > > > > set run_id = concat('scheduled__', substring(run_id, 10, 19))
> > > > > where dag_id = 'daily'
> > > > > and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> > > > > and run_id like 'backfill_%'
> > > > > order by execution_date;
> > > > >
> > > > > On May 25, 2018, 4:03 PM -0700, Scott Halgrim <
> > scott.halgrim@zapier.com
> > > > > ,
> > > > > wrote:
> > > > > > Oh wow, that will work? Thanks! Is there any reason for me not to
> > just
> > > > > run a mass UPDATE on those dag runs directly in the metadata
> > database?
> > > > > >
> > > > > > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yr...@gmail.com>,
> > > > wrote:
> > > > > > > Airflow is not going to schedule backfill DAG runs, by looking
> > at the
> > > > > dag
> > > > > > > run ID (which will start by 'backfill__'). If you want the
> > scheduler
> > > > to
> > > > > > > schedule those tasks, you can click the DAG run and edit its
> name
> > > > back
> > > > > to
> > > > > > > 'scheduled__<something>'
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Kevin Y
> > > > > > >
> > > > > > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > > > > > scott.halgrim@zapier.com.invalid> wrote:
> > > > > > >
> > > > > > > > I’ve got four months of dag runs that were scheduled dag
> runs,
> > > > then I
> > > > > > > > backfilled them. And now when I clear a task from one of
> those
> > the
> > > > > dag run
> > > > > > > > goes to “running,” but none of the tasks get scheduled
> (unless
> > I
> > > > > manually
> > > > > > > > backfill each of them)
> > > > > > > >
> > > > > > > > What I really should have done here was just cleared a
> mid-dag
> > task
> > > > > as
> > > > > > > > well as all downstream tasks for these dag runs, but, well,
> > now I’m
> > > > > here
> > > > > > > > and I’m wondering what the best way to fix this.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > >
> > > > >
> > > >
> >
>

Re: Convert Dag Run from Backfill to Scheduled?

Posted by Ruiqin Yang <yr...@gmail.com>.
This line
<https://github.com/apache/incubator-airflow/blob/master/airflow/jobs.py#L935>
is
where the scheduler skips the backfill DAG runs. Despite what state the DAG
run is in, tasks in DAG run starts with 'backfill_' would not be considered
when scheduling.

I agree with Dan Davydov's idea that we should at least have something like
multiple DAG runs for one execution to distinguish different DAG runs like
scheduled and backfilled. The situation Scott is facing here is not the
only case that lack of multiple DAG run has caused (e.g. manually trigging
a task in the UI should also create a seperate DAG run, otherwise the
implementation logic is a bit wired).

Cheers,
Kevin Y

On Tue, May 29, 2018 at 5:52 PM Scott Halgrim
<sc...@zapier.com.invalid> wrote:

> Well I’ve gone ahead and run the UPDATE query now, so the scheduler is
> picking up tasks.
>
> When I cleared the tasks, every DAG run that had a cleared task in it was
> set to running. Because I’d backfilled them all they were all `backfill_`
> dag runs.  Inspection of various tasks via `task_failed_deps` indicated the
> tasks had all their dependencies filled. After running the update query,
> they’re all `scheduled__` dag runs.
>
> On May 29, 2018, 5:02 PM -0700, Maxime Beauchemin <
> maximebeauchemin@gmail.com>, wrote:
> > While this may work it's clearly not the prescribed way to do this.
> > Clearing should just work.
> >
> > I'm trying to understand why the scheduler is not picking up the cleared
> > task. Clearing should remove the task instance state and set the state of
> > the related DAG Run to running so that the scheduler picks those up.
> > Perhaps there's a conflict between the backfill and scheduler-related DAG
> > Runs? Which DAG runs are set to running? The backfill or
> scheduler-related
> > ones?
> >
> > Originally when I introduced DAG runs, backfill was operating without any
> > consideration related to DAG runs (DAG runs were a scheduler-specific
> > construct), later on Bolke added backfill-specific DAG runs and I'm not
> > 100% sure how that works.
> >
> > Let's get to the bottom of this.
> >
> > Max
> >
> > On Fri, May 25, 2018 at 7:48 PM Ruiqin Yang <yr...@gmail.com> wrote:
> >
> > > If you are sure the update query targets the desired rows, the behavior
> > > should be the same.
> > >
> > > Scott Halgrim <sc...@zapier.com.invalid>于2018年5月25日
> 周五下午4:23写道:
> > >
> > > > So far no ill effects from:
> > > >
> > > > update dag_run
> > > > set run_id = concat('scheduled__', substring(run_id, 10, 19))
> > > > where dag_id = 'daily'
> > > > and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> > > > and run_id like 'backfill_%'
> > > > order by execution_date;
> > > >
> > > > On May 25, 2018, 4:03 PM -0700, Scott Halgrim <
> scott.halgrim@zapier.com
> > > > ,
> > > > wrote:
> > > > > Oh wow, that will work? Thanks! Is there any reason for me not to
> just
> > > > run a mass UPDATE on those dag runs directly in the metadata
> database?
> > > > >
> > > > > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yr...@gmail.com>,
> > > wrote:
> > > > > > Airflow is not going to schedule backfill DAG runs, by looking
> at the
> > > > dag
> > > > > > run ID (which will start by 'backfill__'). If you want the
> scheduler
> > > to
> > > > > > schedule those tasks, you can click the DAG run and edit its name
> > > back
> > > > to
> > > > > > 'scheduled__<something>'
> > > > > >
> > > > > > Cheers,
> > > > > > Kevin Y
> > > > > >
> > > > > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > > > > scott.halgrim@zapier.com.invalid> wrote:
> > > > > >
> > > > > > > I’ve got four months of dag runs that were scheduled dag runs,
> > > then I
> > > > > > > backfilled them. And now when I clear a task from one of those
> the
> > > > dag run
> > > > > > > goes to “running,” but none of the tasks get scheduled (unless
> I
> > > > manually
> > > > > > > backfill each of them)
> > > > > > >
> > > > > > > What I really should have done here was just cleared a mid-dag
> task
> > > > as
> > > > > > > well as all downstream tasks for these dag runs, but, well,
> now I’m
> > > > here
> > > > > > > and I’m wondering what the best way to fix this.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > >
> > > >
> > >
>

Re: Convert Dag Run from Backfill to Scheduled?

Posted by Scott Halgrim <sc...@zapier.com.INVALID>.
Well I’ve gone ahead and run the UPDATE query now, so the scheduler is picking up tasks.

When I cleared the tasks, every DAG run that had a cleared task in it was set to running. Because I’d backfilled them all they were all `backfill_` dag runs.  Inspection of various tasks via `task_failed_deps` indicated the tasks had all their dependencies filled. After running the update query, they’re all `scheduled__` dag runs.

On May 29, 2018, 5:02 PM -0700, Maxime Beauchemin <ma...@gmail.com>, wrote:
> While this may work it's clearly not the prescribed way to do this.
> Clearing should just work.
>
> I'm trying to understand why the scheduler is not picking up the cleared
> task. Clearing should remove the task instance state and set the state of
> the related DAG Run to running so that the scheduler picks those up.
> Perhaps there's a conflict between the backfill and scheduler-related DAG
> Runs? Which DAG runs are set to running? The backfill or scheduler-related
> ones?
>
> Originally when I introduced DAG runs, backfill was operating without any
> consideration related to DAG runs (DAG runs were a scheduler-specific
> construct), later on Bolke added backfill-specific DAG runs and I'm not
> 100% sure how that works.
>
> Let's get to the bottom of this.
>
> Max
>
> On Fri, May 25, 2018 at 7:48 PM Ruiqin Yang <yr...@gmail.com> wrote:
>
> > If you are sure the update query targets the desired rows, the behavior
> > should be the same.
> >
> > Scott Halgrim <sc...@zapier.com.invalid>于2018年5月25日 周五下午4:23写道:
> >
> > > So far no ill effects from:
> > >
> > > update dag_run
> > > set run_id = concat('scheduled__', substring(run_id, 10, 19))
> > > where dag_id = 'daily'
> > > and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> > > and run_id like 'backfill_%'
> > > order by execution_date;
> > >
> > > On May 25, 2018, 4:03 PM -0700, Scott Halgrim <scott.halgrim@zapier.com
> > > ,
> > > wrote:
> > > > Oh wow, that will work? Thanks! Is there any reason for me not to just
> > > run a mass UPDATE on those dag runs directly in the metadata database?
> > > >
> > > > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yr...@gmail.com>,
> > wrote:
> > > > > Airflow is not going to schedule backfill DAG runs, by looking at the
> > > dag
> > > > > run ID (which will start by 'backfill__'). If you want the scheduler
> > to
> > > > > schedule those tasks, you can click the DAG run and edit its name
> > back
> > > to
> > > > > 'scheduled__<something>'
> > > > >
> > > > > Cheers,
> > > > > Kevin Y
> > > > >
> > > > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > > > scott.halgrim@zapier.com.invalid> wrote:
> > > > >
> > > > > > I’ve got four months of dag runs that were scheduled dag runs,
> > then I
> > > > > > backfilled them. And now when I clear a task from one of those the
> > > dag run
> > > > > > goes to “running,” but none of the tasks get scheduled (unless I
> > > manually
> > > > > > backfill each of them)
> > > > > >
> > > > > > What I really should have done here was just cleared a mid-dag task
> > > as
> > > > > > well as all downstream tasks for these dag runs, but, well, now I’m
> > > here
> > > > > > and I’m wondering what the best way to fix this.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > >
> > >
> >

Re: Convert Dag Run from Backfill to Scheduled?

Posted by Maxime Beauchemin <ma...@gmail.com>.
While this may work it's clearly not the prescribed way to do this.
Clearing should just work.

I'm trying to understand why the scheduler is not picking up the cleared
task. Clearing should remove the task instance state and set the state of
the related DAG Run to running so that the scheduler picks those up.
Perhaps there's a conflict between the backfill and scheduler-related DAG
Runs? Which DAG runs are set to running? The backfill or scheduler-related
ones?

Originally when I introduced DAG runs, backfill was operating without any
consideration related to DAG runs (DAG runs were a scheduler-specific
construct), later on Bolke added backfill-specific DAG runs and I'm not
100% sure how that works.

Let's get to the bottom of this.

Max

On Fri, May 25, 2018 at 7:48 PM Ruiqin Yang <yr...@gmail.com> wrote:

> If you are sure the update query targets the desired rows, the behavior
> should be the same.
>
> Scott Halgrim <sc...@zapier.com.invalid>于2018年5月25日 周五下午4:23写道:
>
> > So far no ill effects from:
> >
> > update dag_run
> > set run_id = concat('scheduled__', substring(run_id, 10, 19))
> > where dag_id = 'daily'
> > and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> > and run_id like 'backfill_%'
> > order by execution_date;
> >
> > On May 25, 2018, 4:03 PM -0700, Scott Halgrim <scott.halgrim@zapier.com
> >,
> > wrote:
> > > Oh wow, that will work? Thanks! Is there any reason for me not to just
> > run a mass UPDATE on those dag runs directly in the metadata database?
> > >
> > > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yr...@gmail.com>,
> wrote:
> > > > Airflow is not going to schedule backfill DAG runs, by looking at the
> > dag
> > > > run ID (which will start by 'backfill__'). If you want the scheduler
> to
> > > > schedule those tasks, you can click the DAG run and edit its name
> back
> > to
> > > > 'scheduled__<something>'
> > > >
> > > > Cheers,
> > > > Kevin Y
> > > >
> > > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > > scott.halgrim@zapier.com.invalid> wrote:
> > > >
> > > > > I’ve got four months of dag runs that were scheduled dag runs,
> then I
> > > > > backfilled them. And now when I clear a task from one of those the
> > dag run
> > > > > goes to “running,” but none of the tasks get scheduled (unless I
> > manually
> > > > > backfill each of them)
> > > > >
> > > > > What I really should have done here was just cleared a mid-dag task
> > as
> > > > > well as all downstream tasks for these dag runs, but, well, now I’m
> > here
> > > > > and I’m wondering what the best way to fix this.
> > > > >
> > > > > Thanks!
> > > > >
> > > > >
> >
>

Re: Convert Dag Run from Backfill to Scheduled?

Posted by Ruiqin Yang <yr...@gmail.com>.
If you are sure the update query targets the desired rows, the behavior
should be the same.

Scott Halgrim <sc...@zapier.com.invalid>于2018年5月25日 周五下午4:23写道:

> So far no ill effects from:
>
> update dag_run
> set run_id = concat('scheduled__', substring(run_id, 10, 19))
> where dag_id = 'daily'
> and execution_date > '2017-08-31' and execution_date < '2018-01-11'
> and run_id like 'backfill_%'
> order by execution_date;
>
> On May 25, 2018, 4:03 PM -0700, Scott Halgrim <sc...@zapier.com>,
> wrote:
> > Oh wow, that will work? Thanks! Is there any reason for me not to just
> run a mass UPDATE on those dag runs directly in the metadata database?
> >
> > On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yr...@gmail.com>, wrote:
> > > Airflow is not going to schedule backfill DAG runs, by looking at the
> dag
> > > run ID (which will start by 'backfill__'). If you want the scheduler to
> > > schedule those tasks, you can click the DAG run and edit its name back
> to
> > > 'scheduled__<something>'
> > >
> > > Cheers,
> > > Kevin Y
> > >
> > > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > > scott.halgrim@zapier.com.invalid> wrote:
> > >
> > > > I’ve got four months of dag runs that were scheduled dag runs, then I
> > > > backfilled them. And now when I clear a task from one of those the
> dag run
> > > > goes to “running,” but none of the tasks get scheduled (unless I
> manually
> > > > backfill each of them)
> > > >
> > > > What I really should have done here was just cleared a mid-dag task
> as
> > > > well as all downstream tasks for these dag runs, but, well, now I’m
> here
> > > > and I’m wondering what the best way to fix this.
> > > >
> > > > Thanks!
> > > >
> > > >
>

Re: Convert Dag Run from Backfill to Scheduled?

Posted by Scott Halgrim <sc...@zapier.com.INVALID>.
So far no ill effects from:

update dag_run
set run_id = concat('scheduled__', substring(run_id, 10, 19))
where dag_id = 'daily'
and execution_date > '2017-08-31' and execution_date < '2018-01-11'
and run_id like 'backfill_%'
order by execution_date;

On May 25, 2018, 4:03 PM -0700, Scott Halgrim <sc...@zapier.com>, wrote:
> Oh wow, that will work? Thanks! Is there any reason for me not to just run a mass UPDATE on those dag runs directly in the metadata database?
>
> On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yr...@gmail.com>, wrote:
> > Airflow is not going to schedule backfill DAG runs, by looking at the dag
> > run ID (which will start by 'backfill__'). If you want the scheduler to
> > schedule those tasks, you can click the DAG run and edit its name back to
> > 'scheduled__<something>'
> >
> > Cheers,
> > Kevin Y
> >
> > On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> > scott.halgrim@zapier.com.invalid> wrote:
> >
> > > I’ve got four months of dag runs that were scheduled dag runs, then I
> > > backfilled them. And now when I clear a task from one of those the dag run
> > > goes to “running,” but none of the tasks get scheduled (unless I manually
> > > backfill each of them)
> > >
> > > What I really should have done here was just cleared a mid-dag task as
> > > well as all downstream tasks for these dag runs, but, well, now I’m here
> > > and I’m wondering what the best way to fix this.
> > >
> > > Thanks!
> > >
> > >

Re: Convert Dag Run from Backfill to Scheduled?

Posted by Scott Halgrim <sc...@zapier.com.INVALID>.
Oh wow, that will work? Thanks! Is there any reason for me not to just run a mass UPDATE on those dag runs directly in the metadata database?

On May 25, 2018, 4:01 PM -0700, Ruiqin Yang <yr...@gmail.com>, wrote:
> Airflow is not going to schedule backfill DAG runs, by looking at the dag
> run ID (which will start by 'backfill__'). If you want the scheduler to
> schedule those tasks, you can click the DAG run and edit its name back to
> 'scheduled__<something>'
>
> Cheers,
> Kevin Y
>
> On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
> scott.halgrim@zapier.com.invalid> wrote:
>
> > I’ve got four months of dag runs that were scheduled dag runs, then I
> > backfilled them. And now when I clear a task from one of those the dag run
> > goes to “running,” but none of the tasks get scheduled (unless I manually
> > backfill each of them)
> >
> > What I really should have done here was just cleared a mid-dag task as
> > well as all downstream tasks for these dag runs, but, well, now I’m here
> > and I’m wondering what the best way to fix this.
> >
> > Thanks!
> >
> >

Re: Convert Dag Run from Backfill to Scheduled?

Posted by Ruiqin Yang <yr...@gmail.com>.
Airflow is not going to schedule backfill DAG runs, by looking at the dag
run ID (which will start by 'backfill__'). If you want the scheduler to
schedule those tasks, you can click the DAG run and edit its name back to
'scheduled__<something>'

Cheers,
Kevin Y

On Fri, May 25, 2018 at 3:53 PM, Scott Halgrim <
scott.halgrim@zapier.com.invalid> wrote:

> I’ve got four months of dag runs that were scheduled dag runs, then I
> backfilled them. And now when I clear a task from one of those the dag run
> goes to “running,” but none of the tasks get scheduled (unless I manually
> backfill each of them)
>
> What I really should have done here was just cleared a mid-dag task as
> well as all downstream tasks for these dag runs, but, well, now I’m here
> and I’m wondering what the best way to fix this.
>
> Thanks!
>
>