You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "potiuk (via GitHub)" <gi...@apache.org> on 2023/07/23 07:40:32 UTC

[GitHub] [airflow] potiuk commented on pull request #32780: Quarantine test_backfill_integration in dask executor

potiuk commented on PR #32780:
URL: https://github.com/apache/airflow/pull/32780#issuecomment-1646770573

   Hey here. We have this nasty "dask executor" being extremely flaky recently with database deadlock. 
   
   https://github.com/apache/airflow/issues/32778
   
   It will need some detailed investigation. It would be great if maybe someone takes a deeeeeeeep look at the problem as this one is pretty nasty and next to impossible to reproduce locally (I have not managed to do it yet).
   
   Next week I am travelling around Slovakia with my wife (not yet vacations but as close to it as it might be - these are coming in Halifax in September/October after Summit and before Community over Code conference.  So I will not have time to take a look at this this week. It's also for those who might be counting on my usual responsiveness the coming week - "DON't :)".
   
   Next week I will mostly make sure that the provider separation/configuration stuff will not delay Airflow 2.7 as top priority . I think I have all PRs that are needed for that already up and "green-ish" (with the exception that almost always when I run "full tests" needed  preparation as the top priority one of the Postgress tests fails with this nasty deadlock. 
   
   So instead of investigating and trying to fix that one, I decided to isolate the test and make it less disruptive. In order to "isolate" the test I revived a bit our old "quarantine" mechanism of ours that we used to isolate those kind of tests until they get a good treatment - it was useful when we have quite a number of those ~ 2 years ago but then we had a streak where we solved and investigated almost all of the "regular" flaky tests, so the mechanism was a bit forgotten, but it's the right time to bring it back and polish a bit.
   
   I refreshed it a bit and made more complete - I run the "quarantined" tests now sequentially for all the 4 backends we have and only on self-hosted runners - so it will only be visible for committers and in main builds. @o-nikolas, we have been discussing it recently about the tests that rely on "timing" - this is exactly the mechanism we can use if we see any of our timing test become flaky. We mark the test with `@pytest.mark.quarantined` and it will run in a perfect environment:
   
   * on self-hosted runners
   * in isolation (everything runs sequentially
   * even if it fails, you will see it in the logs but it will not fail the whole build, limit constraints from updating and generally it will not disrupt "green" status of builds (which is good if we want to learn to rely on "green" beeing the indicator of "can be merged".
   
   Most likely when the tests will run in isolation, they will **not** fail - usually those flaky tests are cause by some side-effects or when they are run in "busy" environment in parallel. We might not even have to - eventually - fix it, maybe just running in isolation is the final solution, but I think this one might be an indication of some **REAL** issue we have with deadlocks when backfilling, so it might be worth to investigate and fix it eventually.
   
   This one is difficult to reproduce locally but relatively "easy" to reproduce by a commiter whose tests are run on self-hosted runners:
   
   * remove quarantined marker
   * make a PR with "full tests needed" 
   * likely one of the jobs will fail - if not, Re-run all the jobs - it will fail rather quickly
   
   So if you **think** you found a fix, just closing/reopening PR with removed marker and "full tests needed" should be a good sign that things are fixed. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org