You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/03/15 09:20:44 UTC

[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799257444

Hey @ashb @kaxil -> The #14351 created a good opportunity to test some race conditions for some of the Flaky tests in scheduler and I would like to fix those flaky tests before I merge the parallel change.

I extracted this change out of the 'parallel run change' those were my attempts to make the tests non-flaky.

The attempts helped - I managed to run successfully several test-passes without the problems - both in parallell self-hosted instances as well as in Public Runners, but I am not at all sure whether:

a) this is a good fix
b) whether I am not masking some real problem that might appear under heavy load in real Airflow production system

Since you are the experts in this area, maybe you can try to answer those questions?

I captured example flaky tests in <#14778 #14773 #14772 (you can see stack traces there).

I have several hypothesis why those flaky tests could appear. Maybe you can review/check/veriy my changes and the hypothesis:

* Hypothesis 1): There is a delay between scheduler actually parsing the dag and saving it into SerializedDagModel. Under heavy load it might get slightly delayed and hence the row is not saved there. My changes are assuming that this hypothesis is valid.

* Hypothesis 2): There is another background process resulting from other tests that cleans-up the SerializedDagModel and while previous tests are already finishied (they are run sequentially within the Core test type) some of the background proces is still running and removes the model. If that's the case, then this is a wrong fix. We need to find out a away to get rid of the cross-tests side effects between tests and find a way to kill all the background threads in tearDown/fixture of the tests.

* Hypothesis 3): There is a delay in MySQL/Postgres DB between the actuall commit/flush and the time when the subsequent transaction can read the data. This is - to be honest most interesting one and if that hypothesis is true, my fix is not good either. We should find a way to be sure that the change is actually writtten to the database and available for other sessions to make the tests stable.

I would love to hear your opinion @kaxil @ash - any other hypotheses? What do you think about those hypotheses I have and the fix?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org