You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/09/09 20:07:26 UTC

[GitHub] [airflow] ashb edited a comment on pull request #17576: Add pre/post execution hooks

ashb edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916396684

Forgive me, it's late and been a long day, so I may not be as lucid as I'd hope.

My main concern here is about being able to reason about what a DAG will do. By adding the ability to add arbitrary pre_execute code before any operator in a DAG we end up in a world where it is very hard to look at a DAG and understand what it's going to do.

> So in this case, we can't start processing the data before we know it's come in (let's assume that this is entirely based on time of day, you can't "sense" it).

I dispute the 'you can't "sense" it'. Strongly. And the processing based on timing along is the worst possible idea -- removing arbitray time delays between tasks was one of the main reasons that Airflow has dependencies between tasks.

The evolution of data processing worfklow often goes:

- Oh, I've only got one thing to run, I can put it on cron
- Oh and a second one, but its unrelated to the first, I can cron that to
- Now I want to combine those two outputs, it's okay I'll just delay it by an hour.

That approach will work for months. Right up until you hit an inflection point (more users, more processing) and then suddenly your entire pipeline is in an inconsistent state (maybe you combined data from two different days. Maybe you might not notice it for a month. This is not hyperbole, but lived experience.)

As for "you can't sense it": Either it's a file on disk/s3/blob store, or a table in a DB, but if you are about to have an operator process it (i.e. read it or copy it), then you can, by definition, sense if it's there or not.

So I view the "delay a task X minutes to allow data to be processed" a **huge** anti-patern. Airflow works best when it is deterministic. Doing this makes it not.

To the "skip expensive operation if dev": I've not seen anyone ask for that -- read/write to different bucket in different envs plenty of time, but never skip an operation entirely based on env (cos if you've skipped one step, you have to skip the entire "branch" too.) I had a quick search on https://apache-airflow.slack-archives.org and couldn't find it -- you might have a better idea what to search for (it uses postgress full text search so the stemming might be a bit simplistic)

No, there's nothing I have planned in the AIPs I hinted at in my keynote (most of them are just ideas anyway at this stage)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org