You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Maxime Beauchemin <ma...@gmail.com> on 2016/04/28 01:03:29 UTC

Notes about DagRun-Refactor-(Scheduler-2.0)

Notes related to the proposal here:
https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)

* All of this seems very sound to me. Moving the methods to the right
places will bring a lot of clarity. I clearly see that I'm not alone
understanding the current challenges and potential solutions anymore! This
is awesome!
* DagRun.run_id's purpose is to allow people to define something meaningful
to the grain of their ETL. Say if you wait on a genome file in a folder and
want a DagRun for each genome file, you can put your unique filename as
that run_id and refer to it in your templates/code. It's more of way for
people to express and use their own "run id" that is meaningful to them and
carry it through inside Airflow. Airflow's internals would always use
dag_id and execution_date internally as the key regardless of run_id.

* what goes in DagRun.lock_id? the job_id of the process managing it? What
if it needs to be restarted? We could also just have DagRun.type where type
is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler job
may mean that backfill would appropriates itself the DagRuns that are not
in a running state. Lots of complexity and edge cases in this area...
* One constraint around backfill (until we get the git time-machine up) is
to allow users to run local code with no handoff to the scheduler, so that
you can go to any version of your DAG in your local repo and run the DAG as
defined locally
* I'm unclear on DagRunJob being sync or async, the scheduler needs it to
be async I think, backfill overall should be synchronous and log progress
* Some of the design might need to change to accommodate for the subprocess
handling I just described in the Google group (
https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) that
Paul is working on, but essentially the scheduling needs to take place in a
subprocess and should be async. For backfill it's not a constraint. I could
take place in the main process and can be synchronous...

All of this is fairly brutal and should be broken down in many small PRs
(3? 5?). There are many other large pieces in movement (distributing the
scheduler and parsing DagBag in subprocesses, the git time machine,
docker/containment, ...). We should land the pieces that help everything
else fall into place, and be very careful of changes that make other pieces
of the puzzle harder to fit in.

Max

Re: Notes about DagRun-Refactor-(Scheduler-2.0)

Posted by Jeremiah Lowin <jl...@gmail.com>.
By the, one important thing I forgot to mention: my intention was for this
to be almost 100% transparent to users, at least in its initial version. To
that end, all but 2 *existing* unit tests are currently passing with
minimal changes required (and a fix coming shortly for the stragglers) and
I've added a number of other unit tests around the new functionality.

Maxime, to respond to your thoughts:

- I misunderstood the utility of run_id; I thought it would be replaced by
the more concrete (dag_id, execution_date) key. I understand the use case
and it will go back in.

- DagRun.lock_id is the id number of the job that is currently executing
the DagRun. This could have been a simple True/False flag, but that could
lead to orphaned DagRuns (job locks it, then dies, and it stays locked
forever). So I have a zombie check to make sure that every locked DagRun
corresponds to a living job. If no (living) job matches the lock_id, the
zombie check unlocks the DagRun so it can be executed by a different job.

- Could you please elaborate on the point about the backfill constraint? If
I understand you, we want to make sure that when Backfilling a bunch of
DagRuns, they are only executed by the Backfill and don't get picked up by
the Scheduler. That is easily achieved by locking the DagRuns to the
Backfill's id, but I'm not totally clear on the motivation?

- DagRunJob is as async as the executor it's working with. Think of it as a
generalization of the current Scheduler loop -- all it really does is hand
the tasks off to an Executor in the correct order, and call
executor.heartbeat(). If the executor blocks (like SequentialExecutor),
then DRJ will pause until it's done. If the executor runs async
(Local/Celery), then DRJ will continue. The Backfill DRJ is the same, with
the addition of a progress() function to print % complete and other
statistics, similar to the current Backfill.

- I will have a look at the subprocess work, I'm not familiar with it now.

- +1 to splitting it up :) This is probably revision #3 or #4 and the
history was so messy that I squashed it all. I will break it into
manageable bite-sized commits shortly.

Best,
J

On Wed, Apr 27, 2016 at 7:03 PM Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Notes related to the proposal here:
> https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
>
> * All of this seems very sound to me. Moving the methods to the right
> places will bring a lot of clarity. I clearly see that I'm not alone
> understanding the current challenges and potential solutions anymore! This
> is awesome!
> * DagRun.run_id's purpose is to allow people to define something meaningful
> to the grain of their ETL. Say if you wait on a genome file in a folder and
> want a DagRun for each genome file, you can put your unique filename as
> that run_id and refer to it in your templates/code. It's more of way for
> people to express and use their own "run id" that is meaningful to them and
> carry it through inside Airflow. Airflow's internals would always use
> dag_id and execution_date internally as the key regardless of run_id.
>
> * what goes in DagRun.lock_id? the job_id of the process managing it? What
> if it needs to be restarted? We could also just have DagRun.type where type
> is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler job
> may mean that backfill would appropriates itself the DagRuns that are not
> in a running state. Lots of complexity and edge cases in this area...
> * One constraint around backfill (until we get the git time-machine up) is
> to allow users to run local code with no handoff to the scheduler, so that
> you can go to any version of your DAG in your local repo and run the DAG as
> defined locally
> * I'm unclear on DagRunJob being sync or async, the scheduler needs it to
> be async I think, backfill overall should be synchronous and log progress
> * Some of the design might need to change to accommodate for the subprocess
> handling I just described in the Google group (
> https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) that
> Paul is working on, but essentially the scheduling needs to take place in a
> subprocess and should be async. For backfill it's not a constraint. I could
> take place in the main process and can be synchronous...
>
> All of this is fairly brutal and should be broken down in many small PRs
> (3? 5?). There are many other large pieces in movement (distributing the
> scheduler and parsing DagBag in subprocesses, the git time machine,
> docker/containment, ...). We should land the pieces that help everything
> else fall into place, and be very careful of changes that make other pieces
> of the puzzle harder to fit in.
>
> Max
>

Re: Notes about DagRun-Refactor-(Scheduler-2.0)

Posted by Chris Riccomini <cr...@apache.org>.
Thanks, Jeremiah. I commented on the JIRA.

On Thu, Apr 28, 2016 at 7:50 AM, Jeremiah Lowin <jl...@gmail.com> wrote:

> Thanks to my schedule, Maxime beat me to it :) but nonetheless I'd like to
> call everyone's attention to the proposal here:
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62694286 and
> new tracking issue here: https://issues.apache.org/jira/browse/AIRFLOW-14.
>
> To set the stage, there are two major DAG execution paths in Airflow -
> Scheduler and Backfill. They do not work the same way; in particular,
> Scheduler creates DagRuns and Backfill does not. For users who exclusively
> utilize one or the other, things generally work as expected, but problems
> can arise when the two are used simultaneously (most egregiously with the
> SubDagOperator). In addition to the potential for "cross-Job" collisions,
> the two paths create maintenance difficulties because execution
> bugs/issues/enhancements must be addressed in two different environments.
>
> The framework I'm proposing here unifies all DAG executions around a more
> formalized DagRun concept.
>
> Just as a task/operator is an abstract description of work and a
> TaskInstance is an actual execution with an associated date and state, a
> DAG is a description of workflow and a DagRun is an execution of a DAG with
> associated date and state. In other words: if you want to run a DAG, you
> create a DagRun and let the DagRun and a DagRunJob manage the execution.
> You don't just loop over all the tasks until they're done (looking at you,
> BackfillJob).
>
> So as I've hinted, DagRunJob is a new type of Job that makes sure DagRuns
> execute appropriately. Each DRJ has a set of DagRuns that it's responsible
> for executing until they succeed/fail. However, a core idea is that DagRuns
> are not tied exclusively to any one DRJ; we need to support many-to-many
> relationships where multiple DRJs are trying to execute multiple
> (overlapping) DagRuns. This is critical for making Airflow robust in
> distributed environments, not to mention supporting existing workflows like
> backfilling a DAG while the scheduler is running. The way that works is a
> DRJ takes a lock on any DagRun it's trying to execute; other DRJs skip over
> locked DagRuns.
>
> Scheduler and Backfill are simply subclasses of DagRunJob. Scheduler is
> special in that its execution loop never terminates; it just keeps
> scheduling/looking for DagRuns. Backfill is special in that it automates
> creating the DagRuns for its requested backfill dates; otherwise it's a
> completely normal DRJ.
>
> Anyway, this is a very brief overview of my thought process. Please see
> the wiki/issue for more details and I will respond to any questions
> (including Maxime's in one second!). This is a large change but I believe a
> critical one and the more eyes the better!
>
> J
>
> On Thu, Apr 28, 2016 at 2:11 AM Chris Riccomini <cr...@apache.org>
> wrote:
>
>> Please also open JIRAs for this stuff so people can see what feature work
>> is going on without tracking the mailing list.
>>
>> On Wed, Apr 27, 2016 at 11:10 PM, Chris Riccomini <cr...@apache.org>
>> wrote:
>>
>> > Hey Maxime,
>> >
>> > Great, thanks.
>> >
>> > > We should share our roadmap and sprints systematically, I'll talk to
>> > our PM about making this part of the process.
>> >
>> > Keep in mind that you guys will need to get feedback from the community.
>> > Deciding on how things are implemented (e.g. how DAGs are deployed in
>> > Airflow (is this what git time machine is? I have concerns about using
>> Git
>> > as a deployment mechanism, as you described with Data Swarm)) has to be
>> > done collectively.
>> >
>> > Cheers,
>> > Chris
>> >
>> > On Wed, Apr 27, 2016 at 11:03 PM, Maxime Beauchemin <
>> > maximebeauchemin@gmail.com> wrote:
>> >
>> >> Dan's got a work in progress PR out here around refactoring the
>> dependency
>> >> engine:
>> >> https://github.com/airbnb/airflow/pull/1435
>> >>
>> >> Paul, can you share the work you're doing on the scheduler or your
>> plans?
>> >> The idea there is to parse dags only in short lived subprocesses.
>> >>
>> >> As for the "git time machine" I believe Paul has a wiki page we're
>> getting
>> >> ready to share. Dan has worked on git sync at scale for CI workloads at
>> >> Twitter, so that brings extra confidence in this approach.
>> >>
>> >> About docker/containment it's pretty much just conversations so far.
>> We're
>> >> struggling with the idea of getting some of our chef recipes assets
>> like
>> >> service discovery inside docker containers. Juggling with container in
>> a
>> >> chef world is pretty foreign to all of us.
>> >>
>> >> Most pieces aren't exactly in movement, but we know big things are
>> going
>> >> to
>> >> move soon.
>> >>
>> >> We should share our roadmap and sprints systematically, I'll talk to
>> our
>> >> PM
>> >> about making this part of the process.
>> >>
>> >> Max
>> >>
>> >> On Wed, Apr 27, 2016 at 10:22 PM, Chris Riccomini <
>> criccomini@apache.org>
>> >> wrote:
>> >>
>> >> > > There are many other large pieces in movement (distributing the
>> >> scheduler
>> >> > and parsing DagBag in subprocesses, the git time machine,
>> >> > docker/containment,
>> >> > ...).
>> >> >
>> >> > Maxime, can you please get the work you're doing documented somewhere
>> >> > public?
>> >> >
>> >> > On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
>> >> > maximebeauchemin@gmail.com> wrote:
>> >> >
>> >> > > Notes related to the proposal here:
>> >> > >
>> >> https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
>> >> > >
>> >> > > * All of this seems very sound to me. Moving the methods to the
>> right
>> >> > > places will bring a lot of clarity. I clearly see that I'm not
>> alone
>> >> > > understanding the current challenges and potential solutions
>> anymore!
>> >> > This
>> >> > > is awesome!
>> >> > > * DagRun.run_id's purpose is to allow people to define something
>> >> > meaningful
>> >> > > to the grain of their ETL. Say if you wait on a genome file in a
>> >> folder
>> >> > and
>> >> > > want a DagRun for each genome file, you can put your unique
>> filename
>> >> as
>> >> > > that run_id and refer to it in your templates/code. It's more of
>> way
>> >> for
>> >> > > people to express and use their own "run id" that is meaningful to
>> >> them
>> >> > and
>> >> > > carry it through inside Airflow. Airflow's internals would always
>> use
>> >> > > dag_id and execution_date internally as the key regardless of
>> run_id.
>> >> > >
>> >> > > * what goes in DagRun.lock_id? the job_id of the process managing
>> it?
>> >> > What
>> >> > > if it needs to be restarted? We could also just have DagRun.type
>> where
>> >> > type
>> >> > > is either 'backfill' or 'scheduler'. backfilling to overwrite
>> >> scheduler
>> >> > job
>> >> > > may mean that backfill would appropriates itself the DagRuns that
>> are
>> >> not
>> >> > > in a running state. Lots of complexity and edge cases in this
>> area...
>> >> > > * One constraint around backfill (until we get the git time-machine
>> >> up)
>> >> > is
>> >> > > to allow users to run local code with no handoff to the scheduler,
>> so
>> >> > that
>> >> > > you can go to any version of your DAG in your local repo and run
>> the
>> >> DAG
>> >> > as
>> >> > > defined locally
>> >> > > * I'm unclear on DagRunJob being sync or async, the scheduler needs
>> >> it to
>> >> > > be async I think, backfill overall should be synchronous and log
>> >> progress
>> >> > > * Some of the design might need to change to accommodate for the
>> >> > subprocess
>> >> > > handling I just described in the Google group (
>> >> > > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg
>> )
>> >> that
>> >> > > Paul is working on, but essentially the scheduling needs to take
>> place
>> >> > in a
>> >> > > subprocess and should be async. For backfill it's not a
>> constraint. I
>> >> > could
>> >> > > take place in the main process and can be synchronous...
>> >> > >
>> >> > > All of this is fairly brutal and should be broken down in many
>> small
>> >> PRs
>> >> > > (3? 5?). There are many other large pieces in movement
>> (distributing
>> >> the
>> >> > > scheduler and parsing DagBag in subprocesses, the git time machine,
>> >> > > docker/containment, ...). We should land the pieces that help
>> >> everything
>> >> > > else fall into place, and be very careful of changes that make
>> other
>> >> > pieces
>> >> > > of the puzzle harder to fit in.
>> >> > >
>> >> > > Max
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>

Re: Notes about DagRun-Refactor-(Scheduler-2.0)

Posted by Jeremiah Lowin <jl...@gmail.com>.
Thanks to my schedule, Maxime beat me to it :) but nonetheless I'd like to
call everyone's attention to the proposal here:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=62694286 and
new tracking issue here: https://issues.apache.org/jira/browse/AIRFLOW-14.

To set the stage, there are two major DAG execution paths in Airflow -
Scheduler and Backfill. They do not work the same way; in particular,
Scheduler creates DagRuns and Backfill does not. For users who exclusively
utilize one or the other, things generally work as expected, but problems
can arise when the two are used simultaneously (most egregiously with the
SubDagOperator). In addition to the potential for "cross-Job" collisions,
the two paths create maintenance difficulties because execution
bugs/issues/enhancements must be addressed in two different environments.

The framework I'm proposing here unifies all DAG executions around a more
formalized DagRun concept.

Just as a task/operator is an abstract description of work and a
TaskInstance is an actual execution with an associated date and state, a
DAG is a description of workflow and a DagRun is an execution of a DAG with
associated date and state. In other words: if you want to run a DAG, you
create a DagRun and let the DagRun and a DagRunJob manage the execution.
You don't just loop over all the tasks until they're done (looking at you,
BackfillJob).

So as I've hinted, DagRunJob is a new type of Job that makes sure DagRuns
execute appropriately. Each DRJ has a set of DagRuns that it's responsible
for executing until they succeed/fail. However, a core idea is that DagRuns
are not tied exclusively to any one DRJ; we need to support many-to-many
relationships where multiple DRJs are trying to execute multiple
(overlapping) DagRuns. This is critical for making Airflow robust in
distributed environments, not to mention supporting existing workflows like
backfilling a DAG while the scheduler is running. The way that works is a
DRJ takes a lock on any DagRun it's trying to execute; other DRJs skip over
locked DagRuns.

Scheduler and Backfill are simply subclasses of DagRunJob. Scheduler is
special in that its execution loop never terminates; it just keeps
scheduling/looking for DagRuns. Backfill is special in that it automates
creating the DagRuns for its requested backfill dates; otherwise it's a
completely normal DRJ.

Anyway, this is a very brief overview of my thought process. Please see the
wiki/issue for more details and I will respond to any questions (including
Maxime's in one second!). This is a large change but I believe a critical
one and the more eyes the better!

J

On Thu, Apr 28, 2016 at 2:11 AM Chris Riccomini <cr...@apache.org>
wrote:

> Please also open JIRAs for this stuff so people can see what feature work
> is going on without tracking the mailing list.
>
> On Wed, Apr 27, 2016 at 11:10 PM, Chris Riccomini <cr...@apache.org>
> wrote:
>
> > Hey Maxime,
> >
> > Great, thanks.
> >
> > > We should share our roadmap and sprints systematically, I'll talk to
> > our PM about making this part of the process.
> >
> > Keep in mind that you guys will need to get feedback from the community.
> > Deciding on how things are implemented (e.g. how DAGs are deployed in
> > Airflow (is this what git time machine is? I have concerns about using
> Git
> > as a deployment mechanism, as you described with Data Swarm)) has to be
> > done collectively.
> >
> > Cheers,
> > Chris
> >
> > On Wed, Apr 27, 2016 at 11:03 PM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> >> Dan's got a work in progress PR out here around refactoring the
> dependency
> >> engine:
> >> https://github.com/airbnb/airflow/pull/1435
> >>
> >> Paul, can you share the work you're doing on the scheduler or your
> plans?
> >> The idea there is to parse dags only in short lived subprocesses.
> >>
> >> As for the "git time machine" I believe Paul has a wiki page we're
> getting
> >> ready to share. Dan has worked on git sync at scale for CI workloads at
> >> Twitter, so that brings extra confidence in this approach.
> >>
> >> About docker/containment it's pretty much just conversations so far.
> We're
> >> struggling with the idea of getting some of our chef recipes assets like
> >> service discovery inside docker containers. Juggling with container in a
> >> chef world is pretty foreign to all of us.
> >>
> >> Most pieces aren't exactly in movement, but we know big things are going
> >> to
> >> move soon.
> >>
> >> We should share our roadmap and sprints systematically, I'll talk to our
> >> PM
> >> about making this part of the process.
> >>
> >> Max
> >>
> >> On Wed, Apr 27, 2016 at 10:22 PM, Chris Riccomini <
> criccomini@apache.org>
> >> wrote:
> >>
> >> > > There are many other large pieces in movement (distributing the
> >> scheduler
> >> > and parsing DagBag in subprocesses, the git time machine,
> >> > docker/containment,
> >> > ...).
> >> >
> >> > Maxime, can you please get the work you're doing documented somewhere
> >> > public?
> >> >
> >> > On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
> >> > maximebeauchemin@gmail.com> wrote:
> >> >
> >> > > Notes related to the proposal here:
> >> > >
> >> https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
> >> > >
> >> > > * All of this seems very sound to me. Moving the methods to the
> right
> >> > > places will bring a lot of clarity. I clearly see that I'm not alone
> >> > > understanding the current challenges and potential solutions
> anymore!
> >> > This
> >> > > is awesome!
> >> > > * DagRun.run_id's purpose is to allow people to define something
> >> > meaningful
> >> > > to the grain of their ETL. Say if you wait on a genome file in a
> >> folder
> >> > and
> >> > > want a DagRun for each genome file, you can put your unique filename
> >> as
> >> > > that run_id and refer to it in your templates/code. It's more of way
> >> for
> >> > > people to express and use their own "run id" that is meaningful to
> >> them
> >> > and
> >> > > carry it through inside Airflow. Airflow's internals would always
> use
> >> > > dag_id and execution_date internally as the key regardless of
> run_id.
> >> > >
> >> > > * what goes in DagRun.lock_id? the job_id of the process managing
> it?
> >> > What
> >> > > if it needs to be restarted? We could also just have DagRun.type
> where
> >> > type
> >> > > is either 'backfill' or 'scheduler'. backfilling to overwrite
> >> scheduler
> >> > job
> >> > > may mean that backfill would appropriates itself the DagRuns that
> are
> >> not
> >> > > in a running state. Lots of complexity and edge cases in this
> area...
> >> > > * One constraint around backfill (until we get the git time-machine
> >> up)
> >> > is
> >> > > to allow users to run local code with no handoff to the scheduler,
> so
> >> > that
> >> > > you can go to any version of your DAG in your local repo and run the
> >> DAG
> >> > as
> >> > > defined locally
> >> > > * I'm unclear on DagRunJob being sync or async, the scheduler needs
> >> it to
> >> > > be async I think, backfill overall should be synchronous and log
> >> progress
> >> > > * Some of the design might need to change to accommodate for the
> >> > subprocess
> >> > > handling I just described in the Google group (
> >> > > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg)
> >> that
> >> > > Paul is working on, but essentially the scheduling needs to take
> place
> >> > in a
> >> > > subprocess and should be async. For backfill it's not a constraint.
> I
> >> > could
> >> > > take place in the main process and can be synchronous...
> >> > >
> >> > > All of this is fairly brutal and should be broken down in many small
> >> PRs
> >> > > (3? 5?). There are many other large pieces in movement (distributing
> >> the
> >> > > scheduler and parsing DagBag in subprocesses, the git time machine,
> >> > > docker/containment, ...). We should land the pieces that help
> >> everything
> >> > > else fall into place, and be very careful of changes that make other
> >> > pieces
> >> > > of the puzzle harder to fit in.
> >> > >
> >> > > Max
> >> > >
> >> >
> >>
> >
> >
>

Re: Notes about DagRun-Refactor-(Scheduler-2.0)

Posted by Chris Riccomini <cr...@apache.org>.
Please also open JIRAs for this stuff so people can see what feature work
is going on without tracking the mailing list.

On Wed, Apr 27, 2016 at 11:10 PM, Chris Riccomini <cr...@apache.org>
wrote:

> Hey Maxime,
>
> Great, thanks.
>
> > We should share our roadmap and sprints systematically, I'll talk to
> our PM about making this part of the process.
>
> Keep in mind that you guys will need to get feedback from the community.
> Deciding on how things are implemented (e.g. how DAGs are deployed in
> Airflow (is this what git time machine is? I have concerns about using Git
> as a deployment mechanism, as you described with Data Swarm)) has to be
> done collectively.
>
> Cheers,
> Chris
>
> On Wed, Apr 27, 2016 at 11:03 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
>> Dan's got a work in progress PR out here around refactoring the dependency
>> engine:
>> https://github.com/airbnb/airflow/pull/1435
>>
>> Paul, can you share the work you're doing on the scheduler or your plans?
>> The idea there is to parse dags only in short lived subprocesses.
>>
>> As for the "git time machine" I believe Paul has a wiki page we're getting
>> ready to share. Dan has worked on git sync at scale for CI workloads at
>> Twitter, so that brings extra confidence in this approach.
>>
>> About docker/containment it's pretty much just conversations so far. We're
>> struggling with the idea of getting some of our chef recipes assets like
>> service discovery inside docker containers. Juggling with container in a
>> chef world is pretty foreign to all of us.
>>
>> Most pieces aren't exactly in movement, but we know big things are going
>> to
>> move soon.
>>
>> We should share our roadmap and sprints systematically, I'll talk to our
>> PM
>> about making this part of the process.
>>
>> Max
>>
>> On Wed, Apr 27, 2016 at 10:22 PM, Chris Riccomini <cr...@apache.org>
>> wrote:
>>
>> > > There are many other large pieces in movement (distributing the
>> scheduler
>> > and parsing DagBag in subprocesses, the git time machine,
>> > docker/containment,
>> > ...).
>> >
>> > Maxime, can you please get the work you're doing documented somewhere
>> > public?
>> >
>> > On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
>> > maximebeauchemin@gmail.com> wrote:
>> >
>> > > Notes related to the proposal here:
>> > >
>> https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
>> > >
>> > > * All of this seems very sound to me. Moving the methods to the right
>> > > places will bring a lot of clarity. I clearly see that I'm not alone
>> > > understanding the current challenges and potential solutions anymore!
>> > This
>> > > is awesome!
>> > > * DagRun.run_id's purpose is to allow people to define something
>> > meaningful
>> > > to the grain of their ETL. Say if you wait on a genome file in a
>> folder
>> > and
>> > > want a DagRun for each genome file, you can put your unique filename
>> as
>> > > that run_id and refer to it in your templates/code. It's more of way
>> for
>> > > people to express and use their own "run id" that is meaningful to
>> them
>> > and
>> > > carry it through inside Airflow. Airflow's internals would always use
>> > > dag_id and execution_date internally as the key regardless of run_id.
>> > >
>> > > * what goes in DagRun.lock_id? the job_id of the process managing it?
>> > What
>> > > if it needs to be restarted? We could also just have DagRun.type where
>> > type
>> > > is either 'backfill' or 'scheduler'. backfilling to overwrite
>> scheduler
>> > job
>> > > may mean that backfill would appropriates itself the DagRuns that are
>> not
>> > > in a running state. Lots of complexity and edge cases in this area...
>> > > * One constraint around backfill (until we get the git time-machine
>> up)
>> > is
>> > > to allow users to run local code with no handoff to the scheduler, so
>> > that
>> > > you can go to any version of your DAG in your local repo and run the
>> DAG
>> > as
>> > > defined locally
>> > > * I'm unclear on DagRunJob being sync or async, the scheduler needs
>> it to
>> > > be async I think, backfill overall should be synchronous and log
>> progress
>> > > * Some of the design might need to change to accommodate for the
>> > subprocess
>> > > handling I just described in the Google group (
>> > > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg)
>> that
>> > > Paul is working on, but essentially the scheduling needs to take place
>> > in a
>> > > subprocess and should be async. For backfill it's not a constraint. I
>> > could
>> > > take place in the main process and can be synchronous...
>> > >
>> > > All of this is fairly brutal and should be broken down in many small
>> PRs
>> > > (3? 5?). There are many other large pieces in movement (distributing
>> the
>> > > scheduler and parsing DagBag in subprocesses, the git time machine,
>> > > docker/containment, ...). We should land the pieces that help
>> everything
>> > > else fall into place, and be very careful of changes that make other
>> > pieces
>> > > of the puzzle harder to fit in.
>> > >
>> > > Max
>> > >
>> >
>>
>
>

Re: Notes about DagRun-Refactor-(Scheduler-2.0)

Posted by Chris Riccomini <cr...@apache.org>.
Hey Maxime,

Great, thanks.

> We should share our roadmap and sprints systematically, I'll talk to our
PM about making this part of the process.

Keep in mind that you guys will need to get feedback from the community.
Deciding on how things are implemented (e.g. how DAGs are deployed in
Airflow (is this what git time machine is? I have concerns about using Git
as a deployment mechanism, as you described with Data Swarm)) has to be
done collectively.

Cheers,
Chris

On Wed, Apr 27, 2016 at 11:03 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Dan's got a work in progress PR out here around refactoring the dependency
> engine:
> https://github.com/airbnb/airflow/pull/1435
>
> Paul, can you share the work you're doing on the scheduler or your plans?
> The idea there is to parse dags only in short lived subprocesses.
>
> As for the "git time machine" I believe Paul has a wiki page we're getting
> ready to share. Dan has worked on git sync at scale for CI workloads at
> Twitter, so that brings extra confidence in this approach.
>
> About docker/containment it's pretty much just conversations so far. We're
> struggling with the idea of getting some of our chef recipes assets like
> service discovery inside docker containers. Juggling with container in a
> chef world is pretty foreign to all of us.
>
> Most pieces aren't exactly in movement, but we know big things are going to
> move soon.
>
> We should share our roadmap and sprints systematically, I'll talk to our PM
> about making this part of the process.
>
> Max
>
> On Wed, Apr 27, 2016 at 10:22 PM, Chris Riccomini <cr...@apache.org>
> wrote:
>
> > > There are many other large pieces in movement (distributing the
> scheduler
> > and parsing DagBag in subprocesses, the git time machine,
> > docker/containment,
> > ...).
> >
> > Maxime, can you please get the work you're doing documented somewhere
> > public?
> >
> > On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
> > maximebeauchemin@gmail.com> wrote:
> >
> > > Notes related to the proposal here:
> > > https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
> > >
> > > * All of this seems very sound to me. Moving the methods to the right
> > > places will bring a lot of clarity. I clearly see that I'm not alone
> > > understanding the current challenges and potential solutions anymore!
> > This
> > > is awesome!
> > > * DagRun.run_id's purpose is to allow people to define something
> > meaningful
> > > to the grain of their ETL. Say if you wait on a genome file in a folder
> > and
> > > want a DagRun for each genome file, you can put your unique filename as
> > > that run_id and refer to it in your templates/code. It's more of way
> for
> > > people to express and use their own "run id" that is meaningful to them
> > and
> > > carry it through inside Airflow. Airflow's internals would always use
> > > dag_id and execution_date internally as the key regardless of run_id.
> > >
> > > * what goes in DagRun.lock_id? the job_id of the process managing it?
> > What
> > > if it needs to be restarted? We could also just have DagRun.type where
> > type
> > > is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler
> > job
> > > may mean that backfill would appropriates itself the DagRuns that are
> not
> > > in a running state. Lots of complexity and edge cases in this area...
> > > * One constraint around backfill (until we get the git time-machine up)
> > is
> > > to allow users to run local code with no handoff to the scheduler, so
> > that
> > > you can go to any version of your DAG in your local repo and run the
> DAG
> > as
> > > defined locally
> > > * I'm unclear on DagRunJob being sync or async, the scheduler needs it
> to
> > > be async I think, backfill overall should be synchronous and log
> progress
> > > * Some of the design might need to change to accommodate for the
> > subprocess
> > > handling I just described in the Google group (
> > > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg)
> that
> > > Paul is working on, but essentially the scheduling needs to take place
> > in a
> > > subprocess and should be async. For backfill it's not a constraint. I
> > could
> > > take place in the main process and can be synchronous...
> > >
> > > All of this is fairly brutal and should be broken down in many small
> PRs
> > > (3? 5?). There are many other large pieces in movement (distributing
> the
> > > scheduler and parsing DagBag in subprocesses, the git time machine,
> > > docker/containment, ...). We should land the pieces that help
> everything
> > > else fall into place, and be very careful of changes that make other
> > pieces
> > > of the puzzle harder to fit in.
> > >
> > > Max
> > >
> >
>

Re: Notes about DagRun-Refactor-(Scheduler-2.0)

Posted by Maxime Beauchemin <ma...@gmail.com>.
Dan's got a work in progress PR out here around refactoring the dependency
engine:
https://github.com/airbnb/airflow/pull/1435

Paul, can you share the work you're doing on the scheduler or your plans?
The idea there is to parse dags only in short lived subprocesses.

As for the "git time machine" I believe Paul has a wiki page we're getting
ready to share. Dan has worked on git sync at scale for CI workloads at
Twitter, so that brings extra confidence in this approach.

About docker/containment it's pretty much just conversations so far. We're
struggling with the idea of getting some of our chef recipes assets like
service discovery inside docker containers. Juggling with container in a
chef world is pretty foreign to all of us.

Most pieces aren't exactly in movement, but we know big things are going to
move soon.

We should share our roadmap and sprints systematically, I'll talk to our PM
about making this part of the process.

Max

On Wed, Apr 27, 2016 at 10:22 PM, Chris Riccomini <cr...@apache.org>
wrote:

> > There are many other large pieces in movement (distributing the scheduler
> and parsing DagBag in subprocesses, the git time machine,
> docker/containment,
> ...).
>
> Maxime, can you please get the work you're doing documented somewhere
> public?
>
> On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
> maximebeauchemin@gmail.com> wrote:
>
> > Notes related to the proposal here:
> > https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
> >
> > * All of this seems very sound to me. Moving the methods to the right
> > places will bring a lot of clarity. I clearly see that I'm not alone
> > understanding the current challenges and potential solutions anymore!
> This
> > is awesome!
> > * DagRun.run_id's purpose is to allow people to define something
> meaningful
> > to the grain of their ETL. Say if you wait on a genome file in a folder
> and
> > want a DagRun for each genome file, you can put your unique filename as
> > that run_id and refer to it in your templates/code. It's more of way for
> > people to express and use their own "run id" that is meaningful to them
> and
> > carry it through inside Airflow. Airflow's internals would always use
> > dag_id and execution_date internally as the key regardless of run_id.
> >
> > * what goes in DagRun.lock_id? the job_id of the process managing it?
> What
> > if it needs to be restarted? We could also just have DagRun.type where
> type
> > is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler
> job
> > may mean that backfill would appropriates itself the DagRuns that are not
> > in a running state. Lots of complexity and edge cases in this area...
> > * One constraint around backfill (until we get the git time-machine up)
> is
> > to allow users to run local code with no handoff to the scheduler, so
> that
> > you can go to any version of your DAG in your local repo and run the DAG
> as
> > defined locally
> > * I'm unclear on DagRunJob being sync or async, the scheduler needs it to
> > be async I think, backfill overall should be synchronous and log progress
> > * Some of the design might need to change to accommodate for the
> subprocess
> > handling I just described in the Google group (
> > https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) that
> > Paul is working on, but essentially the scheduling needs to take place
> in a
> > subprocess and should be async. For backfill it's not a constraint. I
> could
> > take place in the main process and can be synchronous...
> >
> > All of this is fairly brutal and should be broken down in many small PRs
> > (3? 5?). There are many other large pieces in movement (distributing the
> > scheduler and parsing DagBag in subprocesses, the git time machine,
> > docker/containment, ...). We should land the pieces that help everything
> > else fall into place, and be very careful of changes that make other
> pieces
> > of the puzzle harder to fit in.
> >
> > Max
> >
>

Re: Notes about DagRun-Refactor-(Scheduler-2.0)

Posted by Chris Riccomini <cr...@apache.org>.
> There are many other large pieces in movement (distributing the scheduler
and parsing DagBag in subprocesses, the git time machine, docker/containment,
...).

Maxime, can you please get the work you're doing documented somewhere
public?

On Wed, Apr 27, 2016 at 4:03 PM, Maxime Beauchemin <
maximebeauchemin@gmail.com> wrote:

> Notes related to the proposal here:
> https://github.com/airbnb/airflow/wiki/DagRun-Refactor-(Scheduler-2.0)
>
> * All of this seems very sound to me. Moving the methods to the right
> places will bring a lot of clarity. I clearly see that I'm not alone
> understanding the current challenges and potential solutions anymore! This
> is awesome!
> * DagRun.run_id's purpose is to allow people to define something meaningful
> to the grain of their ETL. Say if you wait on a genome file in a folder and
> want a DagRun for each genome file, you can put your unique filename as
> that run_id and refer to it in your templates/code. It's more of way for
> people to express and use their own "run id" that is meaningful to them and
> carry it through inside Airflow. Airflow's internals would always use
> dag_id and execution_date internally as the key regardless of run_id.
>
> * what goes in DagRun.lock_id? the job_id of the process managing it? What
> if it needs to be restarted? We could also just have DagRun.type where type
> is either 'backfill' or 'scheduler'. backfilling to overwrite scheduler job
> may mean that backfill would appropriates itself the DagRuns that are not
> in a running state. Lots of complexity and edge cases in this area...
> * One constraint around backfill (until we get the git time-machine up) is
> to allow users to run local code with no handoff to the scheduler, so that
> you can go to any version of your DAG in your local repo and run the DAG as
> defined locally
> * I'm unclear on DagRunJob being sync or async, the scheduler needs it to
> be async I think, backfill overall should be synchronous and log progress
> * Some of the design might need to change to accommodate for the subprocess
> handling I just described in the Google group (
> https://groups.google.com/forum/#!topic/airbnb_airflow/96hd61T7kgg) that
> Paul is working on, but essentially the scheduling needs to take place in a
> subprocess and should be async. For backfill it's not a constraint. I could
> take place in the main process and can be synchronous...
>
> All of this is fairly brutal and should be broken down in many small PRs
> (3? 5?). There are many other large pieces in movement (distributing the
> scheduler and parsing DagBag in subprocesses, the git time machine,
> docker/containment, ...). We should land the pieces that help everything
> else fall into place, and be very careful of changes that make other pieces
> of the puzzle harder to fit in.
>
> Max
>