You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2018/09/18 00:10:26 UTC

[GitHub] afernandez edited a comment on issue #3547: [AIRFLOW-2659] Improve Robustness of Operators in Airflow during Infra Outages

afernandez edited a comment on issue #3547: [AIRFLOW-2659] Improve Robustness of Operators in Airflow during Infra Outages
URL: https://github.com/apache/incubator-airflow/pull/3547#issuecomment-422210823

@Fokko My apologies for replying 2 months later (I was working on other high priority projects and now returning to work on Airflow).
Good question, the primary reason being that the retries in Airflow are mainly meant to handle transient errors where 3-5 retries suffice (or maybe 5 min window). This PR tries to address a larger infrastructure outage that can last several hours.

A user may have a legitimate case for only retrying 3 times (say a particular service is flaky at really high load). Having shorter retries for transient errors ensures enough robustness for flaky services but not high enough that they completely mask unreliable services.

The solution I'm proposing tries to be more intelligent by applying business logic to the particular hook.
If it's indeed a transient-error, then retry according to the existing Airflow logic, but if it's a complete infrastructure outage, then perhaps retry for 2-4 hours. Luckily, services like Hive, Presto, Spark, etc., can provide enough context to make this determination.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services