You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Tomasz Urbaszek <tu...@apache.org> on 2020/07/07 10:55:38 UTC

Defining Airflow idempotence

Hello everyone,

The plenty of integrations with external services a.k.a operators is
one of the bigest advantages of Airflow. As documentation states:
"An operator represents a single, ideally idempotent, task. "

The idempotence - I think - is the key to create a usable operator. It
assures that we can run backfills and use fewer resources. The problem
is that there's no official Airflow definition of idempotence. Or at
least I'm not aware of any.

What do I mean by "Airflow definition"? By this, I mean a guide or
recipe for making an operator idempotent including the limits of
real-world idempotency.

The reason for bringing this topic are those two PRs:
- https://github.com/apache/airflow/pull/9593 which improves creating
Dataproc cluster (create, if exists check state, if wrong then delete
and wait and then create new one)
-https://github.com/apache/airflow/pull/9590 improving BigQuery insert
job idempotency (submit, if job_id exists check state, if running/ok
reattach, if failed then generate new job_id, submit)

Both PRs implements suggestions from our users and solve real,
production-grade problems. Both do this in a non-perfect way because
each of those operators tries to tackle with variety of idempotence
problems. This requires some custom logic that has to work with
non-deterministic situations (i.e. Dataproc and unknown time of
deleting cluster). And that makes me wonder what is the exact
definition of "single, ideally idempotent, task"?

Operators should answer users' needs - there's no question to that.
But it is the community that will have to maintain the operators. And
maintinaing complex logic which is hard (or nearly impossible) to test
in e2e way is not a pleasent task.

What I would like to ask you is:
- what does it mean for you that the operator is idempotent?
- what does it mean "single task"? Does it mean a single event or
operation (set of events)?

By doing this I would like to work on a set of how-to rules for
designing the logic of `execute` method. I would like to encourage you
to share your experiences with desiging and working with complex
operators :)

Hope you are good,
Tomek

Re: Defining Airflow idempotence

Posted by Jarek Potiuk <Ja...@polidea.com>.
I updated the document now - with some "structure" and some
clarifications/context on what we want to achieve by writing the
document - I have not yet thought/elaborated on particular scenarios
in there, but I think having such "context" "aim" for the document and
general structure might help with hashing out the details.

Please take a look and let me know what you think.

J


On Fri, Jul 10, 2020 at 2:55 PM Tomasz Urbaszek <tu...@apache.org> wrote:
>
> Thanks Jacob and Daniel for your insights! I created a draft of "Airflow
> operators design guidelines" https://s.apache.org/airflow-operators
>
> I've left some questions that I think should be addressed. Feel free to
> answer, add yours, comment, suggest and edit. I think that once we have
> some general idea what we as a community expect from a "good operator" we
> can start to think what is mergable or not.
>
> Tomek
>
> On Fri, Jul 10, 2020 at 3:28 AM Daniel Standish <dp...@gmail.com>
> wrote:
>
> > We should be careful not to treat every line in the docs as "constitution"
> > -- i.e. as a commandment.
> >
> > And in the docs, I think we would be better off if we more clearly
> > distinguished (1) the description of what *is* from (2) opinion about
> > what *should
> > be.*
> >
> > *This line should be chopped*
> >
> > Case in point is the line that animates this thread: *"An operator
> > represents a single, ideally idempotent, task."*  (from here
> > <https://airflow.readthedocs.io/en/stable/howto/operator/index.html>)
> >
> > It's just a guess, but I suspect that this line was not meant (or voted on)
> > as a binding rule for the airflow project, but merely meant to serve as a
> > one-sentence answer to the question "what is an operator" on what is
> > essentially a table of contents page.
> >
> > I think we should actually remove the line.
> >
> > Normative bits should be in a normative context, and this kind of content
> > makes more sense in an "operator design patterns" page, or a "best
> > practices" section, where the merits of different patterns and reasoning
> > can be presented.  And to the extent it is meant as a guideline or a rule
> > -- it's too vague to be useful.
> >
> > So I'd propose we chop it and just leave the second line:
> >
> > See the Operators Concepts
> > > <
> > https://airflow.readthedocs.io/en/stable/concepts.html#concepts-operators>
> > documentation
> > > and the Operators API Reference
> > > <https://airflow.readthedocs.io/en/stable/_api/index.html> for more
> > > information.
> >
> >
> > *Is idempotence ideal?*
> >
> > Incidentally, even in the context of a "best practices page" I'd argue
> > against the claim that *"idempotence is ideal."*  First of all it needs
> > clarification about what it actually means.  But suppose that we accept
> > that it means the canonical execution date pattern.  Some pipelines and
> > tasks lend themselves to this pattern; some do not.  And while it is a good
> > pattern where it works, it's not the only valid design pattern, it isn't
> > the best solution for every data problem, and therefore it doesn't make
> > sense to refer to it as "the ideal pattern".
> >
> > The execution_date-based idempotence pattern has special importance to
> > airflow but I think that in reality the average cluster will have a variety
> > of design patterns -- not all of them using the execution_date idempotence
> > pattern.  And I think we should reflect that reality in our docs and
> > decision-making.
> >
> > *What is a single task?*
> >
> > Notwithstanding the above, regarding Tomek's question about the meaning of
> > "single task", I think in effect what is meant here is just "*discrete*
> > task"
> > or "unit of work" -- a unit of work that can be picked up and executed on a
> > worker.  I don't take it as a claim about *recommended* operator scope --
> > if that's what it is meant to be, it should probably be made explicitly and
> > in an appropriate context.
> >
> > *What an operator is, vs what is mergable*
> >
> > On another note, I think it also may be helpful to separate the question
> > "what is an operator" from "what kinds of operators belong in airflow".
> > Indeed this is another area of ambiguity in the quote above -- is it a
> > claim about a best practice for users, as they implement operators for
> > their organization?  Or is it a claim about guidelines when considering
> > whether to merge a new operator into airflow?
> >
> > From the perspective "what is an operator", it is clear to me that an
> > operator is (1) not necessarily idempotent, and (2) has arbitrary scope
> > (i.e. re Tomek's 'what is single task' question).
> >
> >    - idempotence is in general undefined because it depends entirely on how
> >    the user defines the task. (e.g. look at any SqlOperator)
> >    - scope is clearly arbitrary because `execute` can be implemented
> >    arbitrarily.
> >
> > Concerning "what kinds of operators belong in airflow"...  I think it's
> > clear that idempotence is not a requirement (because it's not in general a
> > thing that is determinable based on operator design alone, but depends on
> > usage).   But, are there principles or guidelines that we should try to
> > adhere to, or evaluate against?  There very well might be.  Or, should we
> > try to maintain *compatibility* with a certain notion of idempotence, even
> > if we don't have a well-defined idempotence criteria?  Maybe so
> >



-- 

Jarek Potiuk
Polidea | Principal Software Engineer

M: +48 660 796 129

Re: Defining Airflow idempotence

Posted by Tomasz Urbaszek <tu...@apache.org>.
Thanks Jacob and Daniel for your insights! I created a draft of "Airflow
operators design guidelines" https://s.apache.org/airflow-operators

I've left some questions that I think should be addressed. Feel free to
answer, add yours, comment, suggest and edit. I think that once we have
some general idea what we as a community expect from a "good operator" we
can start to think what is mergable or not.

Tomek

On Fri, Jul 10, 2020 at 3:28 AM Daniel Standish <dp...@gmail.com>
wrote:

> We should be careful not to treat every line in the docs as "constitution"
> -- i.e. as a commandment.
>
> And in the docs, I think we would be better off if we more clearly
> distinguished (1) the description of what *is* from (2) opinion about
> what *should
> be.*
>
> *This line should be chopped*
>
> Case in point is the line that animates this thread: *"An operator
> represents a single, ideally idempotent, task."*  (from here
> <https://airflow.readthedocs.io/en/stable/howto/operator/index.html>)
>
> It's just a guess, but I suspect that this line was not meant (or voted on)
> as a binding rule for the airflow project, but merely meant to serve as a
> one-sentence answer to the question "what is an operator" on what is
> essentially a table of contents page.
>
> I think we should actually remove the line.
>
> Normative bits should be in a normative context, and this kind of content
> makes more sense in an "operator design patterns" page, or a "best
> practices" section, where the merits of different patterns and reasoning
> can be presented.  And to the extent it is meant as a guideline or a rule
> -- it's too vague to be useful.
>
> So I'd propose we chop it and just leave the second line:
>
> See the Operators Concepts
> > <
> https://airflow.readthedocs.io/en/stable/concepts.html#concepts-operators>
> documentation
> > and the Operators API Reference
> > <https://airflow.readthedocs.io/en/stable/_api/index.html> for more
> > information.
>
>
> *Is idempotence ideal?*
>
> Incidentally, even in the context of a "best practices page" I'd argue
> against the claim that *"idempotence is ideal."*  First of all it needs
> clarification about what it actually means.  But suppose that we accept
> that it means the canonical execution date pattern.  Some pipelines and
> tasks lend themselves to this pattern; some do not.  And while it is a good
> pattern where it works, it's not the only valid design pattern, it isn't
> the best solution for every data problem, and therefore it doesn't make
> sense to refer to it as "the ideal pattern".
>
> The execution_date-based idempotence pattern has special importance to
> airflow but I think that in reality the average cluster will have a variety
> of design patterns -- not all of them using the execution_date idempotence
> pattern.  And I think we should reflect that reality in our docs and
> decision-making.
>
> *What is a single task?*
>
> Notwithstanding the above, regarding Tomek's question about the meaning of
> "single task", I think in effect what is meant here is just "*discrete*
> task"
> or "unit of work" -- a unit of work that can be picked up and executed on a
> worker.  I don't take it as a claim about *recommended* operator scope --
> if that's what it is meant to be, it should probably be made explicitly and
> in an appropriate context.
>
> *What an operator is, vs what is mergable*
>
> On another note, I think it also may be helpful to separate the question
> "what is an operator" from "what kinds of operators belong in airflow".
> Indeed this is another area of ambiguity in the quote above -- is it a
> claim about a best practice for users, as they implement operators for
> their organization?  Or is it a claim about guidelines when considering
> whether to merge a new operator into airflow?
>
> From the perspective "what is an operator", it is clear to me that an
> operator is (1) not necessarily idempotent, and (2) has arbitrary scope
> (i.e. re Tomek's 'what is single task' question).
>
>    - idempotence is in general undefined because it depends entirely on how
>    the user defines the task. (e.g. look at any SqlOperator)
>    - scope is clearly arbitrary because `execute` can be implemented
>    arbitrarily.
>
> Concerning "what kinds of operators belong in airflow"...  I think it's
> clear that idempotence is not a requirement (because it's not in general a
> thing that is determinable based on operator design alone, but depends on
> usage).   But, are there principles or guidelines that we should try to
> adhere to, or evaluate against?  There very well might be.  Or, should we
> try to maintain *compatibility* with a certain notion of idempotence, even
> if we don't have a well-defined idempotence criteria?  Maybe so
>

Re: Defining Airflow idempotence

Posted by Daniel Standish <dp...@gmail.com>.
We should be careful not to treat every line in the docs as "constitution"
-- i.e. as a commandment.

And in the docs, I think we would be better off if we more clearly
distinguished (1) the description of what *is* from (2) opinion about
what *should
be.*

*This line should be chopped*

Case in point is the line that animates this thread: *"An operator
represents a single, ideally idempotent, task."*  (from here
<https://airflow.readthedocs.io/en/stable/howto/operator/index.html>)

It's just a guess, but I suspect that this line was not meant (or voted on)
as a binding rule for the airflow project, but merely meant to serve as a
one-sentence answer to the question "what is an operator" on what is
essentially a table of contents page.

I think we should actually remove the line.

Normative bits should be in a normative context, and this kind of content
makes more sense in an "operator design patterns" page, or a "best
practices" section, where the merits of different patterns and reasoning
can be presented.  And to the extent it is meant as a guideline or a rule
-- it's too vague to be useful.

So I'd propose we chop it and just leave the second line:

See the Operators Concepts
> <https://airflow.readthedocs.io/en/stable/concepts.html#concepts-operators> documentation
> and the Operators API Reference
> <https://airflow.readthedocs.io/en/stable/_api/index.html> for more
> information.


*Is idempotence ideal?*

Incidentally, even in the context of a "best practices page" I'd argue
against the claim that *"idempotence is ideal."*  First of all it needs
clarification about what it actually means.  But suppose that we accept
that it means the canonical execution date pattern.  Some pipelines and
tasks lend themselves to this pattern; some do not.  And while it is a good
pattern where it works, it's not the only valid design pattern, it isn't
the best solution for every data problem, and therefore it doesn't make
sense to refer to it as "the ideal pattern".

The execution_date-based idempotence pattern has special importance to
airflow but I think that in reality the average cluster will have a variety
of design patterns -- not all of them using the execution_date idempotence
pattern.  And I think we should reflect that reality in our docs and
decision-making.

*What is a single task?*

Notwithstanding the above, regarding Tomek's question about the meaning of
"single task", I think in effect what is meant here is just "*discrete* task"
or "unit of work" -- a unit of work that can be picked up and executed on a
worker.  I don't take it as a claim about *recommended* operator scope --
if that's what it is meant to be, it should probably be made explicitly and
in an appropriate context.

*What an operator is, vs what is mergable*

On another note, I think it also may be helpful to separate the question
"what is an operator" from "what kinds of operators belong in airflow".
Indeed this is another area of ambiguity in the quote above -- is it a
claim about a best practice for users, as they implement operators for
their organization?  Or is it a claim about guidelines when considering
whether to merge a new operator into airflow?

From the perspective "what is an operator", it is clear to me that an
operator is (1) not necessarily idempotent, and (2) has arbitrary scope
(i.e. re Tomek's 'what is single task' question).

   - idempotence is in general undefined because it depends entirely on how
   the user defines the task. (e.g. look at any SqlOperator)
   - scope is clearly arbitrary because `execute` can be implemented
   arbitrarily.

Concerning "what kinds of operators belong in airflow"...  I think it's
clear that idempotence is not a requirement (because it's not in general a
thing that is determinable based on operator design alone, but depends on
usage).   But, are there principles or guidelines that we should try to
adhere to, or evaluate against?  There very well might be.  Or, should we
try to maintain *compatibility* with a certain notion of idempotence, even
if we don't have a well-defined idempotence criteria?  Maybe so

Re: Defining Airflow idempotence

Posted by Jacob Ferriero <jf...@google.com.INVALID>.
Very curious to follow this discussion!

I think there's been debate about this even internal to airflow for what we
should support for XComm regarding idempotency.
A while back there had been some previous discussions on this and lack of
consensus reverted #6370 <https://github.com/apache/airflow/pull/6370>
killed the idea in PR #6210 <https://github.com/apache/airflow/pull/6210>.
Some interesting threads to review about idempotency and XComm here
<https://github.com/apache/airflow/pull/6210#discussion_r335593800> and here
<https://github.com/apache/airflow/pull/6370#issuecomment-546579924>.

I'm by no means an expert on this but I personally might suggest the
working definition:
"By the end of its lifetime, an airflow task should be authoritative on
target state of the state which that task modifies regardless of previous
task runs"

Or stated less obtusely:
"The state of universe that airflow task can modify should be deterministic
by the end of a task run, irrespective of state changes due to previous
task runs"

This captures the spirit of idempotency by focusing on how an airflow task
affects the state of the universe rather than the implementation details of
if one task run affects the execution path of another task run.

This definition allows for "create X resource if not exists; otherwise
(re)-attach to the state of existing resource X logic" that you describe
for dataproc cluster creation / BQ job creation.
The need for this sort of behavior in airflow extends to being able to
re-attach to any long running task (Dataflow Job, Spark Job, Hive Query
Job, etc, etc.).
This is sort of possible with a SubmitJobOperator and PollJobSensor but
critically misses the ability to retry the job submit on poke indicating a
(retriable) failed state of the job (e.g. job fails because inputs from
some upstream dependency (not managed by airflow e.g. file drop from 3rd
party vendor) don't exists yet) .

A side note, I personally think that in general DELETING a resource that
does not match the desired state of the operator is potentially dangerous
and should always be a configurable behavior (kudos for doing this in
dataproc PR!).

On Thu, Jul 9, 2020 at 7:25 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> All for it. I think misunderstandings and assumptions on what "idempotency"
> really means in the context of Airlfow Tasks has bitten us more than once.
> I'd love to help with working out the right definition (and it's not
> straightforward). I will have to give it quite a bit thinking to get some
> of the corner cases and "guidelines" on them hashed out.
>
> On Tue, Jul 7, 2020 at 12:55 PM Tomasz Urbaszek <tu...@apache.org>
> wrote:
>
> > Hello everyone,
> >
> > The plenty of integrations with external services a.k.a operators is
> > one of the bigest advantages of Airflow. As documentation states:
> > "An operator represents a single, ideally idempotent, task. "
> >
> > The idempotence - I think - is the key to create a usable operator. It
> > assures that we can run backfills and use fewer resources. The problem
> > is that there's no official Airflow definition of idempotence. Or at
> > least I'm not aware of any.
> >
> > What do I mean by "Airflow definition"? By this, I mean a guide or
> > recipe for making an operator idempotent including the limits of
> > real-world idempotency.
> >
> > The reason for bringing this topic are those two PRs:
> > - https://github.com/apache/airflow/pull/9593 which improves creating
> > Dataproc cluster (create, if exists check state, if wrong then delete
> > and wait and then create new one)
> > -https://github.com/apache/airflow/pull/9590 improving BigQuery insert
> > job idempotency (submit, if job_id exists check state, if running/ok
> > reattach, if failed then generate new job_id, submit)
> >
> > Both PRs implements suggestions from our users and solve real,
> > production-grade problems. Both do this in a non-perfect way because
> > each of those operators tries to tackle with variety of idempotence
> > problems. This requires some custom logic that has to work with
> > non-deterministic situations (i.e. Dataproc and unknown time of
> > deleting cluster). And that makes me wonder what is the exact
> > definition of "single, ideally idempotent, task"?
> >
> > Operators should answer users' needs - there's no question to that.
> > But it is the community that will have to maintain the operators. And
> > maintinaing complex logic which is hard (or nearly impossible) to test
> > in e2e way is not a pleasent task.
> >
> > What I would like to ask you is:
> > - what does it mean for you that the operator is idempotent?
> > - what does it mean "single task"? Does it mean a single event or
> > operation (set of events)?
> >
> > By doing this I would like to work on a set of how-to rules for
> > designing the logic of `execute` method. I would like to encourage you
> > to share your experiences with desiging and working with complex
> > operators :)
> >
> > Hope you are good,
> > Tomek
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48%20660%20796%20129> <+48660796129
> <+48%20660%20796%20129>>
> [image: Polidea] <https://www.polidea.com/>
>


-- 

*Jacob Ferriero*

Strategic Cloud Engineer: Data Engineering

jferriero@google.com

617-714-2509

Re: Defining Airflow idempotence

Posted by Jarek Potiuk <Ja...@polidea.com>.
All for it. I think misunderstandings and assumptions on what "idempotency"
really means in the context of Airlfow Tasks has bitten us more than once.
I'd love to help with working out the right definition (and it's not
straightforward). I will have to give it quite a bit thinking to get some
of the corner cases and "guidelines" on them hashed out.

On Tue, Jul 7, 2020 at 12:55 PM Tomasz Urbaszek <tu...@apache.org>
wrote:

> Hello everyone,
>
> The plenty of integrations with external services a.k.a operators is
> one of the bigest advantages of Airflow. As documentation states:
> "An operator represents a single, ideally idempotent, task. "
>
> The idempotence - I think - is the key to create a usable operator. It
> assures that we can run backfills and use fewer resources. The problem
> is that there's no official Airflow definition of idempotence. Or at
> least I'm not aware of any.
>
> What do I mean by "Airflow definition"? By this, I mean a guide or
> recipe for making an operator idempotent including the limits of
> real-world idempotency.
>
> The reason for bringing this topic are those two PRs:
> - https://github.com/apache/airflow/pull/9593 which improves creating
> Dataproc cluster (create, if exists check state, if wrong then delete
> and wait and then create new one)
> -https://github.com/apache/airflow/pull/9590 improving BigQuery insert
> job idempotency (submit, if job_id exists check state, if running/ok
> reattach, if failed then generate new job_id, submit)
>
> Both PRs implements suggestions from our users and solve real,
> production-grade problems. Both do this in a non-perfect way because
> each of those operators tries to tackle with variety of idempotence
> problems. This requires some custom logic that has to work with
> non-deterministic situations (i.e. Dataproc and unknown time of
> deleting cluster). And that makes me wonder what is the exact
> definition of "single, ideally idempotent, task"?
>
> Operators should answer users' needs - there's no question to that.
> But it is the community that will have to maintain the operators. And
> maintinaing complex logic which is hard (or nearly impossible) to test
> in e2e way is not a pleasent task.
>
> What I would like to ask you is:
> - what does it mean for you that the operator is idempotent?
> - what does it mean "single task"? Does it mean a single event or
> operation (set of events)?
>
> By doing this I would like to work on a set of how-to rules for
> designing the logic of `execute` method. I would like to encourage you
> to share your experiences with desiging and working with complex
> operators :)
>
> Hope you are good,
> Tomek
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>