You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/01/24 17:04:51 UTC

[GitHub] [airflow] alex-astronomer opened a new issue #21072: manage_sla firing notifications for the same sla miss instances repeatedly

alex-astronomer opened a new issue #21072:
URL: https://github.com/apache/airflow/issues/21072


   ### Apache Airflow version
   
   2.1.4
   
   ### What happened
   
   SLAMiss is firing notifications (Slack notification, as defined by the sla_miss_callback) but every time it calls the sla_miss_callback it's sending notifications for the same set of tasks.  It seems as though the notification sent flag in the database is never set to true.  This happens when there are a large number of sla misses that need to be processed at the same time.
   
   The use case for this is backfilling a DAG that runs frequently starting at ~1 month ago.  This causes around 14k sla misses to need to be processed all at the same time.
   
   ### What you expected to happen
   
   Expected that sla_miss_callback is called, and then by the end of managing the SLAs, they no longer need to be processed.  Expect that SLAs are managed one time, and then not used again when managing SLAs.
   
   We found the root cause for this issue.  This happens because the DAGFileProcessor is timing out before the transactions that change notification sent = True for the SLAs to be committed to the database.  This is a somewhat weird "in-between" case because the timeout is long enough that the sla_miss_callback runs, but not long enough that all of the flags can be changed in the database.  This causes the same SLAs to be processed over and over again every time we manage SLAs.
   
   The offending line in the code base is the commit call at the end of manage SLAs.  When we try to commit the changes to all 14k records, the DAGFileProcessor times out in the middle of that line.
   
   ### How to reproduce
   
   Generate many SLA misses all at once.  This can be triggered by setting the start date for a DAG in the past and setting it to run frequently.  Then, once manage slas is called, we process all of the SLA misses at the same time, causing a pile up in the system.
   
   After, we have to get the timeout just right such that sla_miss_callback runs, but the transactions are not committed to the database.  This will all depend on the system that this reproduction is running on.
   
   ### Operating System
   
   macOS Big Sur 11.3.1
   
   ### Versions of Apache Airflow Providers
   
   n/a
   
   ### Deployment
   
   Astronomer
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #21072: manage_sla firing notifications for the same sla miss instances repeatedly

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #21072:
URL: https://github.com/apache/airflow/issues/21072#issuecomment-1073907847


   I tihnk SLA mechanism is generally up for rewriting, it has many more problems then that. I would say it needs to be a part of the rewrite (but there is no timeline for that).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] bskim45 commented on issue #21072: manage_sla firing notifications for the same sla miss instances repeatedly

Posted by GitBox <gi...@apache.org>.
bskim45 commented on issue #21072:
URL: https://github.com/apache/airflow/issues/21072#issuecomment-1070181632


   I'm experiencing a similar issue somewhat related to this. When the `sla` argument is provided but an SLA miss email is not sent nor `sla_miss_callback` is not specified, SlaMiss entries are piled up on the `sla_miss` table with `notification_sent=false`. This causes calling `DAGFileProcessor.manage_slas` times out for that callback processing.
   
   quick example:
   ```python
   with DAG(
       dag_id="example_dag",
       schedule_interval='@hourly',
       start_date=days_ago(1),
       catchup=False,
   ) as dag:
       dummy_task = DummyOperator(
           task_id='dummy_task',
           sla=datetime.timedelta(hours=18),
       )
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org