You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/12 12:44:54 UTC

[GitHub] [airflow] malthe opened a new pull request #17576: Add pre/post execution hooks

malthe opened a new pull request #17576:
URL: https://github.com/apache/airflow/pull/17576


   This adds optional operator parameters `pre_execute` and `post_execute`, both of which run synchronously immediately before and after task execution.
   
   For example, this can be used to skip execution for a particular day of the week:
   ```python
   from croniter import match
   
   def execute_unless(cron_expr):
       def hook(context):
           if match(cron_expr, context["execution_date"]):
               raise AirflowSkipException()
       return hook
   
   DummyOperator(..., pre_execute=execute_unless("* * * * sat"))
   ```
   
   Related: #17545


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916379253


   One of the good cases is to distinguish between DEV/PROD environments. 
   
   For example some of the operators  could be skipped on DEV environments and you might want to add single method throwing "Ski[p" exception in the DEV environment (for costly operations) 
   
   This is mostly a feature that promotes good engineering practices such as being able to run the same code on different environments, wtihout changing the DAGs or adding artifficial logic in the DAG structure, to handle different execution environments. 
   
   I think this is the main case (often raised by our users in Slack). If we do not have hooks, the users would have to develop custom operators of different kind to handle different behaviours. Having the configurable hooks you can deploy the same DAGs for different environments without adding complex logic or having to add custom operators.
   
   Many of our users (and DAG writers) are Data scientists. You could potentially have the same effect by extending N operators and adding pre/post hooks there but for those users classes and objects are "alien" - they think functional and adding pre/post hooks is so much easier, especially that you can much more easliy have one common function that you pass to a number of different, completley unrelated operators. This is a classic "cross-cutting-concern" implementation IMHO.
   
   This is at least what has been my motivation when guiding and approving this one.
   
   Any other idea how to approach it? I am not against other solutions, and I would like to hear what anti-patterns it promotes (in the context I described). Could you tell @ash what antipattern is in play here, as I am not sure  I see it ?
   
   Of course if it somehow makes other parts (non-completed AIP's especially) much more difficult to implement. we can change the way it is implemented, but I think the cabability of making "cross-cutting-concern" - for example tied to an environment where the DAG is executed is a powerful feature that might make the deployment and testing DAGs at scale much easier. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] JavierLopezT edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
JavierLopezT edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-920716223


   I think this is a great feature
   
   We use custom SnowflakeOperator and S3ToSnowflakeOperator, and it would be great if we can't avoid use them and just have the official operators with pre and post execute callbacks. For instance, our S3ToSnowflake has an argument called _previous_queries_, that will execute a query before the copy for creating a table. Then, we have functions for moving and/or delete s3 files and finally, making a constraint keys check. All those things happen either before or after the code of the current operator, so here is another use case for this functionality. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-898261046


   Looking at the test, there must be a reason the test it here (not only to annoy the user) and it says explicitly that new fields should not be added, so that makes me wonder if just adding the fields to the test is a good idea..
   
   @kaxil - is that OK that we add new fields to Base Operator? Will that work for already serialized Tasks? to de-serialize them without problems ? Or do we need to do something else besides adding the two two fields to the test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-898029293


   LGTM - but serialisation needs to be updated to skip those new fields (tests are failing).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-903329383


   @potiuk rebased – but what comment from @kaxil do you mean? I don't see any ...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-901396372


   Hey @kaxil -> do you think we can just add _pre/_post to the test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-904010653


   > These two new fields you add don't need to be serialized (and callables can't generally be anyway) -- the general rule is that things needed by the Scheduler should be serialized, and that test was there to make people think about the change.
   > 
   > In this case, since the scheduler doesn't care about these fields they should be added to the ignore list in the test.
   
   Ah cool.  I will close/reopen to rebuild, but I think this one is good-to-go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916375678


   @malthe Can you give a real example (or as real as you can) where you want a DAG to start at (say) 7am but one task _in that same dag_ to not start until 7:30? My gut says those are not the same workflow if that's the case. (Am I understanding the use case you mentioned https://github.com/apache/airflow/pull/18112#discussion_r705622219 right?)
   
   And I should clarify my previous statement -- I want to disucss reverting this or keeping it, I'm not going to just revert it blindly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] JavierLopezT edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
JavierLopezT edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-920716223


   I think this is a great feature
   
   I use custom SnowflakeOperator and S3ToSnowflakeOperator, and it would be great if we can't avoid use them and just have the official operators with pre and post execute callbacks. For instance, our S3ToSnowflake has an argument called _previous_queries_, that will execute a query before the copy for creating a table. Then, we have functions for moving and/or delete s3 files and finally, making a constraint keys check. All those things happen either before or after the code of the current operator, so here is another use case for this functionality. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916426991


    I think I even discussed it with @malthe that time based decision will be better done based on AIP-39. So I agree @ashb - this is likely not a good use of those pre/post_hooks. see https://github.com/apache/airflow/pull/17545 discussion . I
   
   agree timetable will be WAY better way of handling the "skip_unless"<CRON>" "  ( and honestly I have not paid attention at the example in PR "description"  - more at the implementation). And explicit better and implicit surely - I would never encourage a "logical change" in processing the DAG based on those. I think we are aligned here.
   
   But the cross-cutting concerns which are not time-related, but environment-related, I still think there is a good use of it. I actually spent last ~ month or so, mostly tuned to listen, analyse respond and help users in slack/stackoverflow/github discussion and this has been explicitly asked at least once (I think this is where the idea was born that it might make sense0: https://apache-airflow.slack.com/archives/CCR6P6JRL/p1629705521304600 but I also personally think that better support for more "enterprise'y" setup where you have "mostly the same but subtly different" environment and "same dags but behaving slightly differently" is something that is really useful.
   
   And just one more comment - this is just a bit easier way of doing something that is entirely possible today, so while I understand we might not want to "promote" it, just having pre/post hooks is encouraging similar patterns. Maybe we should rather get rid of those pre/post hooks if we want to go 'clean" as it promotes even "worse" patterns. For me the pre/post executed looked entirely unnatural and useless to be honest - up until we added the possibility of overriding those (which is the cross-cutting-concern I am mentioning), If anything, I think it makes it much cleaner and easier to manage.
   
   (Pseudo code of course)
   
   Before:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   def MyCustomOperatorA(OperatorA):
        def pre_execute():
              pre_hook()
   
   def MyCustomOperatorB(OperatorB):
        def pre_execute():
              pre_hook()
   
   with DAG() as dag:.
       a = MyCustomOperatorA()
       b =  MyCustomOperatorB() 
   
       a >> b
   ```
   
   After:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   with DAG() as dag:.
       a = OperatorA(pre_execute = pre_hook)
       b =  OperatorB(pre_execute = pre_hook)
       a >> b 
   ```
   
   But I might of course be biased, I usually think more from "development" point of view - so not "how easy  and clean it is to run it in produtiton" but "how easy it is to develop and test it in the dev/test/staging env before it gets into PROD". I just see DAG development as a process rather than end-goal. This might be very biased and maybe it can be achieved in a simpler way :). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-898261046


   Looking at the test, there must be a reason the test it here (not only to annoy the user) and it says explicitly that new fields should not be added, so that makes me wonder if just adding the fields to the test is a good idea/enough.
   
   @kaxil - is that OK that we add new fields to Base Operator? Will that work for already serialized Tasks? to de-serialize them without problems ? Or do we need to do something else besides adding the two two fields to the test?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-898216061


   @potiuk in `BaseOperator.get_serialized_fields` I see for example "_upstream_task_ids" but not "_downstream_task_ids".
   
   What's the logic? And do I need to add these new `_pre_execute_hook` and `_post_execute_hook` attributes here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916793392


   We can even print warning when it is used, that it is experimental.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916426991


    I think I even discussed it with @malthe that time based decision will be better done based on AIP-39. So I agree @ashb - this is likely not a good use of those pre/post_hooks. see https://github.com/apache/airflow/pull/17545 discussion . 
   
   I agree timetable will be WAY better way of handling the "skip_unless"<CRON>" "  ( and honestly I have not paid attention at the example in PR "description"  - more at the implementation). And explicit better and implicit surely - I would never encourage a "logical change" in processing the DAG based on those. I think we are aligned here.
   
   But the cross-cutting concerns which are not time-related, but environment-related, I still think there is a good use of it. I actually spent last ~ month or so, mostly tuned to listen, analyse respond and help users in slack/stackoverflow/github discussion and this has been explicitly asked at least once (I think this is where the idea was born that it might make sense0: https://apache-airflow.slack.com/archives/CCR6P6JRL/p1629705521304600 but I also personally think that better support for more "enterprise'y" setup where you have "mostly the same but subtly different" environment and "same dags but behaving slightly differently" is something that is really useful. 
   
   And especially when you add the fact that a lot of people compose their DAGs from existing operators without creating custom ones, adding an option to add "common" environmental behaviour to a number of different operators sounds appealing.
   
   And just one more comment - this is just a bit easier way of doing something that is entirely possible today, so while I understand we might not want to "promote" it, just having pre/post hooks is encouraging similar patterns. Maybe we should rather get rid of those pre/post methods if we want to go 'clean" as it promotes even "worse" patterns. For me the pre/post execute looked entirely unnatural and useless to be honest - up until we added the possibility of overriding those (which is the cross-cutting-concern I am mentioning), If anything, I think it makes it much cleaner and easier to manage.
   
   (Pseudo code of course)
   
   Before:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   def MyCustomOperatorA(OperatorA):
        def pre_execute():
              pre_hook()
   
   def MyCustomOperatorB(OperatorB):
        def pre_execute():
              pre_hook()
   
   with DAG() as dag:.
       a = MyCustomOperatorA()
       b =  MyCustomOperatorB() 
   
       a >> b
   ```
   
   After:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   with DAG() as dag:.
       a = OperatorA(pre_execute = pre_hook)
       b =  OperatorB(pre_execute = pre_hook)
       a >> b 
   ```
   
   But I might of course be biased, I usually think more from "development" point of view - so not "how easy  and clean it is to run it in produtiton" but "how easy it is to develop and test it in the dev/test/staging env before it gets into PROD". I just see DAG development as a process rather than end-goal. This might be very biased and maybe it can be achieved in a simpler way :). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
eladkal commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-897731012


   The discussion in the issue is pretty extensive. I want to ask/offer another point of view on this functionality.
   We have `VerticaToMySqlOperator` - This operator has pre & post actions to be configured by the user:
   
   https://github.com/apache/airflow/blob/4556828dccba2df0ab26b0aeeb26d0a30944da6f/airflow/providers/mysql/transfers/vertica_to_mysql.py#L85-L86
   
   Is the goal of this PR to replace & generalize this use case or the real issue you are trying to solve is something else ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-898028425


   > Is the goal of this PR to replace & generalize this use case or the real issue you are trying to solve is something else ?
   
   Something completely different. Pre/postoperators in Vertica are SQL statements called before/after (Those are strings rather than callables)
   
   What this PR tries to achieve is to provide an option to override pre_/post_execute methods from the base operator with ones provided as constructor parameters. While it was already possible to  override them by extending existing operators, it makes it a bit easier (and with less mental barriers for DAG writers) to provide such methods when you create task in DAG. While previously it required to create a new class and use the new class as operator, this on allows to create methods and pass them as parameters of the cosnstructor - which is easier for many users who write the DAGs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-986363233


   Another possibly interesting use case for pre_execute: 
   
   https://apache-airflow.slack.com/archives/CCVDRN0F9/p1638740065139100
   
   
   Basically adding a "session" attribute that would authorize the call with PythonOperators - defined at a DAG level. Pretty useful IMHO: 
   
   ```
   def authorize(self):
         CREDENTIALS, _= google.auth.default(scopes=['https://www.googleapis.com/auth/cloud-platform'])
         self.session = AuthorizedSession(CREDENTIALS)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916395891


   Just to add one more thing @ashb - if there are other features in upcoming AIP's that might make it useless/obsolete/outdated/replaceable - I am all for it. We just simply might not be aware of some of it as we are not following it super closely (I tried with as much time I could spare) - but if there are other ways to achieve those  things mentined above or thing that they can break that you are aware of  - it would be great  to align here. 
   
   We might simply not know everything you and others working on new AIPs know :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-904500996


   Great — yes, I think it’s a good compromise until we better understand the
   problems around skipping tasks etc. based on scheduling policies.
   
   tir. 24. aug. 2021 kl. 10.58 skrev Jarek Potiuk ***@***.***>:
   
   > Hey @malthe <https://github.com/malthe> - I think this one, will be very
   > powerful. After some discussions on Slack recently, the potential of those
   > overrideable hooks is very interesting for a number of cases :). Glad we
   > got it in for 2.2
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/airflow/pull/17576#issuecomment-904457121>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAAGOJNSP6D64MVPVRNCLQTT6NNJTANCNFSM5CBDJKZA>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
   > .
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-903326972


   Hey @kaxil - any comments here? And @malthe - woudl you please rebase after the comment from @kaxil?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
eladkal commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916429784


   > I think that should be explicit in the dag
   
   +1
   
   > I hadn't considered something like "give my stream processing up to X minutes to process the data for this interval and make it available".
   
   But I think that should be explicit in the dag. If this task should wait until 30mins after the interval is over, then to make it easy to reason about that should be a (Async)TimeDeltaSensor.
   
   Yeah that works. but also isn't it a natural extension for trigger rules? for now trigger rules verify only against upstream tasks status. Would it be reasonable to extend it also to support rules of other areas like timedelta from upstream tasks?
   Something like: ` DummyOperator(..., trigger_rule={"state": "all_done", "delay": 30})`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916381389


   @ashb a real example could be two data sources coming in at different times, each one needs some post-processing and it makes sense to do this immediately – because ultimately, both are used to form some output that should be available as soon as possible.
   
   So in this case, we can't start processing the data before we know it's come in (let's assume that this is entirely based on time of day, you can't "sense" it). I think it's a fairly normal situation – and at least, in the workflows I'm seeing, the pattern is used all of the place. Is it "one workflow" (i.e. DAG) – I would say definitely because the dependencies are real.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916396684


   Forgive me, it's late and been a long day, so I may not be as lucid as I'd hope.
   
   My main concern here is about being able to reason about what a DAG will do. By adding the ability to add arbitrary pre_execute code before any operator in a DAG we end up in a world where it is very hard to look at a DAG and understand what it's going to do. 
   
   > So in this case, we can't start processing the data before we know it's come in (let's assume that this is entirely based on time of day, you can't "sense" it).
   
   I dispute the 'you can't "sense" it'. Strongly.  And the processing based on timing along is the worst possible idea -- removing arbitray time delays between tasks was one of the main reasons that Airflow has dependencies between tasks.
   
   The evolution of data processing worfklow often goes:
   
    - Oh, I've only got one thing to run, I can put it on cron
    - Oh and a second one, but its unrelated to the first, I can cron that to
    - Now I want to combine those two outputs, it's okay I'll just delay it by an hour.
    
   That approach will work for months. Right up until you hit an inflection point (more users, more processing) and then suddenly your entire pipeline is in an inconsistent state (maybe you combined data from two different days. Maybe you might not notice it for a month. This is not hyperbole, but lived experience.)
   
   As for "you can't sense it": Either it's a file on disk/s3/blob store, or a table in a DB, but if you are about to have an operator process it (i.e. read it or copy it), then you can, by definition, sense if it's there or not. 
   
   To the "skip expensive operation if dev": I've not seen anyone ask for that -- read/write  to different bucket in different envs plenty of time, but never skip an operation entirely based on env (cos if you've skipped one step, you have to skip the entire "branch" too.) I had a quick search on https://apache-airflow.slack-archives.org and couldn't find it -- you might have a better idea what to search for (it uses postgress full text search so the stemming might be a bit simplistic)
   
   No, there's nothing I have planned in the AIPs I hinted at in my keynote (most of them are just ideas anyway at this stage)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916663394


   @eladkal or in my case, instead of a "delay", it would be `trigger_rule={..., "start_time": time(7, 30)}`.
   
   @ashb and I agree about it ideally being explicit. @potiuk's concern was that adding support for skipping executions (i.e. skip on Sunday) and ensuring that a task starts to execute only within some window of time (start time and/or end time) – might stand in the way of future development.
   
   AIP-39 works on the DAG-level. What I'm asking for here is functionality to:
   
   - Conditionally skip task execution based on `execution_date`
   - Trigger the execution of a task only within some window of time
   
   Those are task-level scheduling preferences if you will. I think both of those should ideally be implemented at the scheduling-level and be supported all the way out in the Airflow UI. In this sense, `pre_execute` and the follow-up pull request around deferred tasks are a workaround following @potiuk's reasoning here:
   
   > I think - since we have a simple, generic exception based approach that can be implemented quickly without breaking any of the existing properties of scheduling and execution - we should go for it now.
   
   I don't think it ends up promoting anti-patterns but like pretty much anything, it can be abused to that effect. But it was always a means to an end – the scheduling functionality I wanted is in the bullets above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
eladkal edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916429784


   > I think that should be explicit in the dag
   
   +1
   
   > I hadn't considered something like "give my stream processing up to X minutes to process the data for this interval and make it available".
   >
   >But I think that should be explicit in the dag. If this task should wait until 30mins after the interval is over, then to make it easy to reason about that should be a (Async)TimeDeltaSensor.
   
   Yeah that works. but also isn't it a natural extension for trigger rules? for now trigger rules verify only against upstream tasks status. Would it be reasonable to extend it also to support rules of other areas like timedelta from upstream tasks?
   Something like: ` DummyOperator(..., trigger_rule={"state": "all_done", "delay": 30})`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #17576:
URL: https://github.com/apache/airflow/pull/17576


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916406721


   @ashb I do think there are often bad reasons behind wanting to start a task no earlier than some timestamp, but I will maintain that there can be good reasons too and it's not always a single file but instead saying "after this timestamp, I will accept no more data" (and start processing what I have).
   
   So it's not at all about "allowing data to be processed" inside Airflow (i.e. giving enough time, rather than utilizing dependencies), but knowing that some data is available from an external source after a certain point in time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916792442


   > I'd like to get 2.2 release train started, and in order to not delay that unnecessarily how about we mark this feature as experimental -- that way if we decide we _don't_ want it for any reason we can remove it before 3.0 (as otherwise it's removeal would be a breaking change under SemVer)
   
   Absolutely no problem for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916790800


   @ashb yes that sounds very reasonable – to me, `pre_execute` is really just a workaround. In a perfect world, the scheduling system is able to support these use-cases out of the box, obviating the need for it altogether. I'd love to prioritize thinking about these problems seeing if a better solution can be devised before 3.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-920781011


   I think that might be indeed why @ashb insisted on making that feature experimental - because I see that each one of us has different "use case" in mind when looking at the pre/post hooks.
   
   Now when I think of it, it's really closing the gap between more `functional` and more `object oriented` approach of writing DAGs and I think we need to probably figure out a better way of doing that without this hybrid approach where we have Operator classes and functional extensions of them, that does sound like a bit of a "monster hybrid".
   
   The thing is - there is nothing "more" that feature adds - you can do everything you do with the pre-/post- hooks currently (also what you do @JavierLopezT ) but in a slightly less verbose way (with less classes and more functions).
   
   I do not yet how, but I have a feeling that maybe indeed we should try to find a bit nicer way to incorporate this in the current direction we are taking with more "decorator" / "task" oriented way of writing dags and make sure that the current operator classes ( we have sooo many useful ones) can be much easier integrated with the "functional" way of writing dags and incorporating that pre/post behaviour there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-903329602


   I was asking @kaxil  for comment about the serialisation and reason why the tests explicitly mention come fields :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916361093


   Can anyone think of a good use case for this that _isn't_ the `execute_unless` example? Cos that should be covered by the new timetable interface coming in AIP-39, or through an existing sensor.
   
   I'm seriously _considering_ reverting this feature as right now I'm seeing it promoting anti-patterns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916426991


    I think I even discussed it with @malthe that time based decision will be better done based on AIP-39. So I agree @ashb - this is likely not a good use of those pre/post_hooks. see https://github.com/apache/airflow/pull/17545 discussion . 
   
   I agree timetable will be WAY better way of handling the "skip_unless"<CRON>" "  ( and honestly I have not paid attention at the example in PR "description"  - more at the implementation). And explicit better and implicit surely - I would never encourage a "logical change" in processing the DAG based on those. I think we are aligned here.
   
   But the cross-cutting concerns which are not time-related, but environment-related, I still think there is a good use of it. I actually spent last ~ month or so, mostly tuned to listen, analyse respond and help users in slack/stackoverflow/github discussion and this has been explicitly asked at least once (I think this is where the idea was born that it might make sense0: https://apache-airflow.slack.com/archives/CCR6P6JRL/p1629705521304600 but I also personally think that better support for more "enterprise'y" setup where you have "mostly the same but subtly different" environment and "same dags but behaving slightly differently" is something that is really useful. 
   
   And especially when you add the fact that a lot of people compose their DAGs from existing operators without creating custom ones, adding an option to add "common" environmental behaviour to a number of different operators sounds appealing.
   
   And just one more comment - this is just a bit easier way of doing something that is entirely possible today, so while I understand we might not want to "promote" it, just having pre/post hooks is encouraging similar patterns. Maybe we should rather get rid of those pre/post hooks if we want to go 'clean" as it promotes even "worse" patterns. For me the pre/post executed looked entirely unnatural and useless to be honest - up until we added the possibility of overriding those (which is the cross-cutting-concern I am mentioning), If anything, I think it makes it much cleaner and easier to manage.
   
   (Pseudo code of course)
   
   Before:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   def MyCustomOperatorA(OperatorA):
        def pre_execute():
              pre_hook()
   
   def MyCustomOperatorB(OperatorB):
        def pre_execute():
              pre_hook()
   
   with DAG() as dag:.
       a = MyCustomOperatorA()
       b =  MyCustomOperatorB() 
   
       a >> b
   ```
   
   After:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   with DAG() as dag:.
       a = OperatorA(pre_execute = pre_hook)
       b =  OperatorB(pre_execute = pre_hook)
       a >> b 
   ```
   
   But I might of course be biased, I usually think more from "development" point of view - so not "how easy  and clean it is to run it in produtiton" but "how easy it is to develop and test it in the dev/test/staging env before it gets into PROD". I just see DAG development as a process rather than end-goal. This might be very biased and maybe it can be achieved in a simpler way :). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916413420


   @malthe Valid point -- I hadn't considered something like "give my stream processing up to X minutes to process the data for this interval and make it available".
   
   But I think that should be _explicit_ in the dag. If this task should  wait until 30mins after the interval is over, then to make it easy to reason about that should be a (Async)TimeDeltaSensor.
   
   Explicit is better than magic, and using pre_execute to run custom code to skip feels like magic to me. This whole feature just adds a lot of scope for a lot of "accidental complexity" in both Airflow itself, and in the DAGs people write.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916426991


    I think I even discussed it with @malthe that time based decision will be better done based on AIP-39. So I agree @ashb - this is likely not a good use of those pre/post_hooks. see https://github.com/apache/airflow/pull/17545 discussion . 
   
   I agree timetable will be WAY better way of handling the "skip_unless"<CRON>" "  ( and honestly I have not paid attention at the example in PR "description"  - more at the implementation). And explicit better and implicit surely - I would never encourage a "logical change" in processing the DAG based on those. I think we are aligned here.
   
   But the cross-cutting concerns which are not time-related, but environment-related, I still think there is a good use of it. I actually spent last ~ month or so, mostly tuned to listen, analyse respond and help users in slack/stackoverflow/github discussion and this has been explicitly asked at least once (I think this is where the idea was born that it might make sense0: https://apache-airflow.slack.com/archives/CCR6P6JRL/p1629705521304600 but I also personally think that better support for more "enterprise'y" setup where you have "mostly the same but subtly different" environment and "same dags but behaving slightly differently" is something that is really useful. 
   
   And especially when you add the fact that a lot of people compose their DAGs from existing operators without creating custom ones, adding an option to add "common" environmental behaviour to a number of different operators sounds appealing.
   
   And just one more comment - this is just a bit easier way of doing something that is entirely possible today, so while I understand we might not want to "promote" it, just having pre/post hooks is encouraging similar patterns. Maybe we should rather get rid of those pre/post methods if we want to go 'clean" as it promotes even "worse" patterns. For me the pre/post executed looked entirely unnatural and useless to be honest - up until we added the possibility of overriding those (which is the cross-cutting-concern I am mentioning), If anything, I think it makes it much cleaner and easier to manage.
   
   (Pseudo code of course)
   
   Before:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   def MyCustomOperatorA(OperatorA):
        def pre_execute():
              pre_hook()
   
   def MyCustomOperatorB(OperatorB):
        def pre_execute():
              pre_hook()
   
   with DAG() as dag:.
       a = MyCustomOperatorA()
       b =  MyCustomOperatorB() 
   
       a >> b
   ```
   
   After:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   with DAG() as dag:.
       a = OperatorA(pre_execute = pre_hook)
       b =  OperatorB(pre_execute = pre_hook)
       a >> b 
   ```
   
   But I might of course be biased, I usually think more from "development" point of view - so not "how easy  and clean it is to run it in produtiton" but "how easy it is to develop and test it in the dev/test/staging env before it gets into PROD". I just see DAG development as a process rather than end-goal. This might be very biased and maybe it can be achieved in a simpler way :). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916426991


    I think I even discussed it with @malthe that time based decision will be better done based on AIP-39. So I agree @ashb - this is likely not a good use of those pre/post_hooks. see https://github.com/apache/airflow/pull/17545 discussion . I
   
   agree timetable will be WAY better way of handling the "skip_unless"<CRON>" "  ( and honestly I have not paid attention at the example in PR "description"  - more at the implementation). And explicit better and implicit surely - I would never encourage a "logical change" in processing the DAG based on those. I think we are aligned here.
   
   But the cross-cutting concerns which are not time-related, but environment-related, I still think there is a good use of it. I actually spent last ~ month or so, mostly tuned to listen, analyse respond and help users in slack/stackoverflow/github discussion and this has been explicitly asked at least once (I think this is where the idea was born that it might make sense0: https://apache-airflow.slack.com/archives/CCR6P6JRL/p1629705521304600 but I also personally think that better support for more "enterprise'y" setup where you have "mostly the same but subtly different" environment and "same dags but behaving slightly differently" is something that is really useful. 
   
   And especially when you add the fact that a lot of people compose their DAGs from existing operators without creating custom ones, adding an option to add "common" environmental behaviour to a number of different operators sounds appealing.
   
   And just one more comment - this is just a bit easier way of doing something that is entirely possible today, so while I understand we might not want to "promote" it, just having pre/post hooks is encouraging similar patterns. Maybe we should rather get rid of those pre/post hooks if we want to go 'clean" as it promotes even "worse" patterns. For me the pre/post executed looked entirely unnatural and useless to be honest - up until we added the possibility of overriding those (which is the cross-cutting-concern I am mentioning), If anything, I think it makes it much cleaner and easier to manage.
   
   (Pseudo code of course)
   
   Before:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   def MyCustomOperatorA(OperatorA):
        def pre_execute():
              pre_hook()
   
   def MyCustomOperatorB(OperatorB):
        def pre_execute():
              pre_hook()
   
   with DAG() as dag:.
       a = MyCustomOperatorA()
       b =  MyCustomOperatorB() 
   
       a >> b
   ```
   
   After:
   
   ```
   def pre_hook():
        raise AirflowSkipException() if DEV
   
   with DAG() as dag:.
       a = OperatorA(pre_execute = pre_hook)
       b =  OperatorB(pre_execute = pre_hook)
       a >> b 
   ```
   
   But I might of course be biased, I usually think more from "development" point of view - so not "how easy  and clean it is to run it in produtiton" but "how easy it is to develop and test it in the dev/test/staging env before it gets into PROD". I just see DAG development as a process rather than end-goal. This might be very biased and maybe it can be achieved in a simpler way :). 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-904453307


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-903721414


   These two new fields you add don't need to be serialized (and callables can't generally be anyway) -- the general rule is that things needed by the Scheduler should be serialized, and that test was there to make people think about the change.
   
   In this case, since the scheduler doesn't care about these fields they should be added to the ignore list in the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-904457121


   Hey @malthe  - I think this one, will be very powerful. After some discussions on Slack recently, the potential of those overrideable hooks is very interesting for a number of cases :). Glad we got it in for 2.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb edited a comment on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
ashb edited a comment on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916396684


   Forgive me, it's late and been a long day, so I may not be as lucid as I'd hope.
   
   My main concern here is about being able to reason about what a DAG will do. By adding the ability to add arbitrary pre_execute code before any operator in a DAG we end up in a world where it is very hard to look at a DAG and understand what it's going to do. 
   
   > So in this case, we can't start processing the data before we know it's come in (let's assume that this is entirely based on time of day, you can't "sense" it).
   
   I dispute the 'you can't "sense" it'. Strongly.  And the processing based on timing along is the worst possible idea -- removing arbitray time delays between tasks was one of the main reasons that Airflow has dependencies between tasks.
   
   The evolution of data processing worfklow often goes:
   
    - Oh, I've only got one thing to run, I can put it on cron
    - Oh and a second one, but its unrelated to the first, I can cron that to
    - Now I want to combine those two outputs, it's okay I'll just delay it by an hour.
    
   That approach will work for months. Right up until you hit an inflection point (more users, more processing) and then suddenly your entire pipeline is in an inconsistent state (maybe you combined data from two different days. Maybe you might not notice it for a month. This is not hyperbole, but lived experience.)
   
   As for "you can't sense it": Either it's a file on disk/s3/blob store, or a table in a DB, but if you are about to have an operator process it (i.e. read it or copy it), then you can, by definition, sense if it's there or not. 
   
   So I view the "delay a task X minutes to allow data to be processed" a **huge** anti-patern. Airflow works best when it is deterministic. Doing this makes it not.
   
   To the "skip expensive operation if dev": I've not seen anyone ask for that -- read/write  to different bucket in different envs plenty of time, but never skip an operation entirely based on env (cos if you've skipped one step, you have to skip the entire "branch" too.) I had a quick search on https://apache-airflow.slack-archives.org and couldn't find it -- you might have a better idea what to search for (it uses postgress full text search so the stemming might be a bit simplistic)
   
   No, there's nothing I have planned in the AIPs I hinted at in my keynote (most of them are just ideas anyway at this stage)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916788361


   I'd like to get 2.2 release train started, and in order to not delay that unnecessarily how about we mark this feature as experimental -- that way if we decide we _don't_ want it for any reason we can remove it before 3.0 (as otherwise it's removeal would be a breaking change under SemVer)
   
   Thoughts on this interim approach @malthe @potiuk ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916379253


   One of the good cases is to distinguish between DEV/PROD environments. 
   
   For example some of the operators  could be skipped on DEV environments and you might want to add single method throwing "Ski[p" exception in the DEV environment. 
   
   This is mostly a feature that promotes good engineering practices such as being able to run the same code on different environments, wtihout changing the DAGs or adding artifficial logic in the DAG structure, to handle different execution environments. 
   
   I think this is the main case (often raised by our users in Slack). If we do not have hooks, the users would have to develop custom operators of different kind to handle different behaviours. Having the configurable hooks you can deploy the same DAGs for different environments without adding complex logic or having to add custom operators.
   
   Many of our users (and DAG writers) are Data scientists. You could potentially have the same effect by extending N operators and adding pre/post hooks there but for those users classes and objects are "alien" - they think functional and adding pre/post hooks is so much easier, especially that you can much more easliy have one common function that you pass to a number of different, completley unrelated operators. This is a classic "cross-cutting-concern" implementation IMHO.
   
   This is at least what has been my motivation when guiding and approving this one.
   
   Any other idea how to approach it? I am not against other solutions, and I would like to hear what anti-patterns it promotes (in the context I described). Could you tell @ash what antipattern is in play here, as I am not sure  I see it ?
   
   Of course if it somehow makes other parts (non-completed AIP's especially) much more difficult to implement. we can change the way it is implemented, but I think the cabability of making "cross-cutting-concern" - for example tied to an environment where the DAG is executed is a powerful feature that might make the deployment and testing DAGs at scale much easier. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-916381389


   @ashb a real example could be two data sources coming in at different times, each one needs some post-processing and it makes sense to do this immediately – because ultimately, both are used to form some output that should be available as soon as possible.
   
   So in this case, we can't start processing the data before we know it's come in (let's assume that this is entirely based on time of day, you can't "sense" it). I think it's a fairly normal situation – and at least, in the workflows I'm seeing, the pattern is used all of the place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] JavierLopezT commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
JavierLopezT commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-920716223


   I think this is a great feature. It's what I wanted to achieve when I opened https://github.com/apache/airflow/issues/16038, but probably I miss explained myself.
   
   I use custom SnowflakeOperator and S3ToSnowflakeOperator, and it would be great if we can't avoid use them and just have the official operators with pre and post execute callbacks. For instance, our S3ToSnowflake has an argument called _previous_queries_, that will execute a query before the copy for creating a table. Then, we have functions for moving and/or delete s3 files and finally, making a constraint keys check. All those things happen either before or after the code of the current operator, so here is another use case for this functionality. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] malthe commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
malthe commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-920742851


   I don’t think that is a good use-case for pre/post-execute because those
   tasks could easily (and should) be done using regular operator instances
   and dependencies.
   
   tor. 16. sep. 2021 kl. 10.55 skrev JavierLopezT ***@***.***>:
   
   > I think this is a great feature. It's what I wanted to achieve when I
   > opened #16038 <https://github.com/apache/airflow/issues/16038>, but
   > probably I miss explained myself.
   >
   > I use custom SnowflakeOperator and S3ToSnowflakeOperator, and it would be
   > great if we can't avoid use them and just have the official operators with
   > pre and post execute callbacks. For instance, our S3ToSnowflake has an
   > argument called *previous_queries*, that will execute a query before the
   > copy for creating a table. Then, we have functions for moving and/or delete
   > s3 files and finally, making a constraint keys check. All those things
   > happen either before or after the code of the current operator, so here is
   > another use case for this functionality.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > <https://github.com/apache/airflow/pull/17576#issuecomment-920716223>, or
   > unsubscribe
   > <https://github.com/notifications/unsubscribe-auth/AAAGOJM5B5L6CQ2OV2UMU2DUCGWGLANCNFSM5CBDJKZA>
   > .
   > Triage notifications on the go with GitHub Mobile for iOS
   > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
   > or Android
   > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
   >
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
potiuk closed pull request #17576:
URL: https://github.com/apache/airflow/pull/17576


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on pull request #17576: Add pre/post execution hooks

Posted by GitBox <gi...@apache.org>.
uranusjr commented on pull request #17576:
URL: https://github.com/apache/airflow/pull/17576#issuecomment-904360286


   Helm chart seems to be failing with a network error. Not related.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org