You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "potiuk (via GitHub)" <gi...@apache.org> on 2023/03/07 23:02:37 UTC

[GitHub] [airflow] potiuk commented on pull request #29815: Add retry to the scheduler loop to protect against DB hiccups

potiuk commented on PR #29815:
URL: https://github.com/apache/airflow/pull/29815#issuecomment-1458999232

   > When each `create_session` block exits, it calls `session.commit()` and writes to the database. So say if a database failure happens during lines 226–232, the entire block would be restarted and you end up writing the database twice without calling `heartbeat_callback` correctly. I don’t know if it would be problematic, but it is awkward. Ideally I would imagine the two `create_session` blocks should be retried separately instead.
   
   @uranusjr Yep. This is very awkward.  And this awkwardness is precisely what I am going to change during AIP-44 implementation. There is an (very unfinished) draft commit:  https://github.com/apache/airflow/commit/f11f5afbddfe39a9f0e31bc1fc1ba3cc1dfa5394 that depends on merging https://github.com/apache/airflow/pull/29776 which will solve this issue (by splitting the "before", "during", and "after" task into three steps (and three DB sessions when internal DB api won't be used - when the internal API will be used there will be no "during" session at all).
   
   > @potiuk DB connection failures happen. Airflow is running in distributed environment. As much as we want to have a reliable infrastructure it's not possible all the time. Heartbeat is the most frequent operation in the airflow DB and it's success has a significant impact on the task success or failure. What problems do you anticipate with making it more resilient to disruptions?
   
   @bjankie1  Yes. I understand that and sympathise with such statement. But the solution to that is not to retry a failed transaction without looking at the consequences (pointed out by @uranusjr nicely) . The proposal of yours is a "band-aid" which might create more problems. The right approach for making system resilent to DB problems is to understand every single DB transaction that happens in the system and deliberately design behaviour of what happens if the transaction fails and act appropriately to recover. Retrying failed transaction without understanding of the consequences is a recipe for disaster. And yes in this case the whole transaction is ... awward - as also nicely pointed out by @uranusjr. and the right approach is to fix the behaviour to make more sense (and so you can reason about it) and only after that to implement recovery scenario (which might actually be not needed - because the way I think it will work when I complete the refactor is that those transactions 
 will be split into tthree (or two in case of internal DB API) so that there will be no need to recover becuse the problem you observe will not exist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org