You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/03/15 09:06:58 UTC

[GitHub] [airflow] potiuk opened a new pull request #14792: Fixes some of the flaky tests in test_scheduler_job

potiuk opened a new pull request #14792:
URL: https://github.com/apache/airflow/pull/14792


   The Parallel tests from #14531 created a good opportunity to
   reproduce some of the race conditions that cause some of the
   scheduler job test to be flaky.
   
   This change is an attempt to fix three of the flaky tests
   there by adding serialized dag writing in the place which is
   likely to suffer from race condition.
   
   Fixes: #14778 #14773 #14772
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799767732


   Btw. I f we add _del_  i believe.rhe #14805 is not needed and should not be merged ? Do i read it right ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799835271


   When I run it locally (and change the expected number) I got exactly 10 calls for the serialized_dag as expected:
   
   So somwhere, something executes one more seriaized_daag get() when the tests are run in parallell. Any guess? Is there a possibiliy of re-parsing/re-retrieving the the serialized_dag after some time ?  (we have max_run set to 1). I doubt it is about some side effect from other tests (however it is possible as well). 
   
   ```
   E           AssertionError: The expected number of db queries is 196. The current number is 195.
   E           
   E           Recorded query locations:
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1853:	1
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1885:	1
   E           	scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks>taskinstance.py:__repr__:899:	100
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1912:	1
   E           	scheduler_job.py:_do_scheduling>retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags:1592:	1
   E           	retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs:1608:	1
   E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:229:	10
   E           	scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs>dag.py:create_dagrun:1781:	10
   E           	dag.py:create_dagrun>dagrun.py:verify_integrity>dagrun.py:get_task_instances:329:	10
   E           	scheduler_job.py:_create_dag_runs>dag.py:create_dagrun>dagrun.py:verify_integrity:674:	10
   E           	scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs>scheduler_job.py:_update_dag_next_dagruns:1659:	1
   E           	scheduler_job.py:_do_scheduling>retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags:1595:	1
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:_do_scheduling:1533:	1
   E           	scheduler_job.py:_schedule_dag_run>scheduler_job.py:_verify_integrity_if_dag_changed>serialized_dag.py:get_latest_version_hash:285:	10
   E           	dagrun.py:update_state>dagrun.py:task_instance_scheduling_decisions>dagrun.py:get_task_instances:329:	10
   E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_schedule_dag_run>dagrun.py:update_state:480:	10
   E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_schedule_dag_run>dagrun.py:schedule_tis:761:	10
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:_do_scheduling:1549:	1
   E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>pool.py:slots_stats:107:	1
   E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>pool.py:slots_stats:111:	1
   E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued:949:	1
   E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued><string>:<lambda>:1:	1
   E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>scheduler_job.py:__get_concurrency_maps:895:	1
   E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued:1090:	1
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799816772


   Also the number of queries seems to be affected by the changes but there is likely something else that affects it 'slighty`
   
   https://github.com/apache/airflow/runs/2116031674?check_suite_focus=true#step:6:10600:
   
   We got 196 queries here instead of 196 and it is puzzling why:
   
   
   ```
   E           AssertionError: The expected number of db queries is 195. The current number is 196.
     E           
     E           Recorded query locations:
     E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1839:	1
     E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1867:	1
     E           	scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks>taskinstance.py:__repr__:899:	100
     E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1893:	1
     E           	scheduler_job.py:_do_scheduling>retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags:1573:	1
     E           	retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs:1592:	1
     E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:229:	11
     E           	scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs>dag.py:create_dagrun:1781:	10
     E           	dag.py:create_dagrun>dagrun.py:verify_integrity>dagrun.py:get_task_instances:329:	10
     E           	scheduler_job.py:_create_dag_runs>dag.py:create_dagrun>dagrun.py:verify_integrity:674:	10
     E           	scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs>scheduler_job.py:_update_dag_next_dagruns:1646:	1
     E           	scheduler_job.py:_do_scheduling>retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags:1576:	1
     E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:_do_scheduling:1514:	1
     E           	scheduler_job.py:_schedule_dag_run>scheduler_job.py:_verify_integrity_if_dag_changed>serialized_dag.py:get_latest_version_hash:285:	10
     E           	dagrun.py:update_state>dagrun.py:task_instance_scheduling_decisions>dagrun.py:get_task_instances:329:	10
     E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_schedule_dag_run>dagrun.py:update_state:480:	10
     E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_schedule_dag_run>dagrun.py:schedule_tis:767:	10
     E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:_do_scheduling:1530:	1
     E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>pool.py:slots_stats:107:	1
     E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>pool.py:slots_stats:114:	1
     E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued:934:	1
     E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued><string>:<lambda>:1:	1
     E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>scheduler_job.py:__get_concurrency_maps:879:	1
     E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued:1075:	1
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799767732


   Btw. I f we add _del_  i believe.rhe #14085 is not needed and should not be merged ? Do i read it right ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799809668


   Yep. the `__del__` method is mostly about reference counting and as long as it is safe to use it in prod, It makes it easier for tests to automtically finalized it.
   
   @ashb  I have not added executor to del for now - I want to keep this change localized to just make sure that SchedulerJobs are cleaned up when reference counting goes to 0 ( and I want to make sure it actually will happen) I keep the old "test" change handy in case we find that flakiness of tests is - for whatever reason not handled by the destructor/finalizer.  I want to make sure it actually fixes the problem and when it does, I think adding executor should be the next step.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799758821


   > This can be achieved with a much smaller code change, and in such a way that new tests can't get this wrong, [using `__del__`](https://docs.python.org/3/reference/datamodel.html#object.__del__)
   
   I wanted to minimize the "airflow" code change and focused only on changing the tests. are you sure it will not have any effects int he airflow code? Destructors are dangerous because they are usually changing behaviour of garbage collecting experience in  the life code. Are you sure it is absolutely safe to add the destructors? Do we really want to impact the production code in order to fix the flaky tests? Or do you think it should be there in the first place ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799833201


   Yep. Repeated again https://github.com/apache/airflow/pull/14531/checks?check_run_id=2117086179#step:6:9923 
   
   ```
     E           AssertionError: The expected number of db queries is 195. The current number is 196.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799615845


   > I think your first hypothesis is most likely -- the parsing is kicked off by a background process, and with the introduction of parallel tests the load average will be higher, so the parsing process will likely be slightly slower to read the DAG off disk and parse it.
   
   I looked at the code and I believe this also means that Hypothesis 2) might come into place. I believe some SchedulerJob originated processes might still be running when test finishes and they can make all the DB updates that can either remove existing data there or update it. 
    
   I guess we can stabilize the tests by not only executing clean_db() in tearDown but also we could register all SchedulerJobs and end them gracefully in the tearDown. I believe for example this could be the reasons for https://github.com/apache/airflow/issues/14773 . I will experiment a bit and push a more complete fix. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799825155


   I've added a clean_db() in tearDown  for this one, It should have no result (but it is consistent with other test cases). 
   
   This line is suspicious:
   
   ```
     E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:229:	11
   ```
   
   From how I understan  with 10 dags it should only run serialized_dag 10 times but it called it 11 times. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#discussion_r594580729



##########
File path: tests/jobs/test_scheduler_job.py
##########
@@ -2822,6 +2824,7 @@ def test_verify_integrity_if_dag_changed(self):
         assert scheduler.dagbag.dags == {'test_verify_integrity_if_dag_changed': dag}
         assert len(scheduler.dagbag.dags.get("test_verify_integrity_if_dag_changed").tasks) == 1
 
+        SerializedDagModel.write_dag(dag=dag)

Review comment:
       I think this one was releated to the scheduler processes still running in the background - see #14773 . I removed that line now with my tests.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799775141


   Yes, this finalizer will be safe in production https://github.com/apache/airflow/blob/a639dd364865da7367f342d5721a5f46a7188a29/airflow/cli/commands/scheduler_command.py#L31-L63 - `job` is in scope until the process exits/is just about to exit.
   
   If you add executor to the `__del__`, then yes, 14085 won't be needed (it's still draft anyway)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799813016


   @potiuk Sounds good. @jedcunningham See the above discussion -- once this is merged can you update your PR to call executor.end from in the finalizer this PR adds?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799766339


   I mean - i have no problem with adding destructor to the class to fix the problem. But i am not sure if there aren't any side effects - but if you confirm it is safe, i am happy to try it - all of those changes where pretty automated with Intellij and can be reverted easily, i am just.thinking if there are no production side effects. But if you confirm there aren't  I am happy to change it to use _del_ 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799695743


   Yep. I believe the remnant processes from SchedulerJob would explain vast majority of the flaky tests we had in scheduler. For example changing counts of queries, wrong status of tasks and the like. 
   
   I think I fixed all of them.  I will run a few more tests (running full tests suite locally several times) but It looks very much like this.
   
   @ashb I'd love if you take a look and confirm my findings/see if I fixed it correctly. There is one issue (test_retry_handling_job) left - it is consistently failing so I marked it as xfail and likely it need fixing because I think it's behaviour changed now. But I think it should be taken care of separately.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-800118478


   I will continue chasing it and trying with both `__del__` solution and explicit cleanup in tearDown ( I have now two PRs where i am testing both) but any help is appreciated.  
   
   I want to make sure those unstable/side-effect tests are fixed before I merge the parallel test change because they are much easier to trigger there.
   
   Also another problem I found that because the tests are running so much faster we are much more likely to hit docker limits, so I had to retag (not rebuild!) the images we pull (postgres/mongo etc. ) and push them to apache/airflow docker registry (apache registry is exempted from the pull rate limits)  - It's a simple but affective way to handle the limits and we do not have to maintain those images at all  - they are just tagged and pushed with the right tag: #14819
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799781452


   Cool. I will fix it then. Glad that we will be able to fix it finally ! It was pain in the neck for quite a while


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799741565


   I run it 5 times on my PC and it succeeded. Tryitng to see the benchmark from before the change. Some unrelated stuff failed so closing and reopening to restart the build


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799759843


   I would strongly prefer to change only test code now - and if we think that the __del__ change is good for production code, I think it shold be a follow-up change with the full-test scenario before releasing Airflow. I am really afraid this change might affect the prodution code and I am pretty sure it should be extensively tested (unlike the change in test code only).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-800681024


   Hey @ashb  - unfortunately we cannot and should not use `_del_` method. This what I was afraid of but SchedulerJob is not an ordinrary class - this is an SQLAlchemy managed class and (as I was afraid) __del__ will not run well for this class because SQL Alchemy will introduce reference cycles and __del__ will not work in this case. So far I was only suspecting this but now I am quite sure of that, This is the warning printed by SQLAlchemy:
   
   ```
     /usr/local/lib/python3.6/site-packages/sqlalchemy/orm/instrumentation.py:93 SAWarning: __del__() method on class <class 'airflow.jobs.scheduler_job.SchedulerJob'> will cause unreachable cycles and memory leaks, as SQLAlchemy instrumentation often creates reference cycles.  Please remove this method.
   ```
   
   It's difficult to argue with this. I had the same feeling when I read how __dell__ and reference counting works and when I saw SchedulerJob being the ORM class. But now I am 100% sure it was a bad idea.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk closed pull request #14792:
URL: https://github.com/apache/airflow/pull/14792


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799257444


   Hey @ashb @kaxil  -> The #14351 created a good opportunity to test some race conditions for some of the Flaky tests in scheduler and I would like to fix those flaky tests before I merge the parallel change.
   
   I extracted this change out of the 'parallel run change' those were my attempts to make the tests non-flaky. 
   
   The attempts helped - I managed to run successfully several test-passes without the problems - both in parallell self-hosted instances as well as in Public Runners, but I am not at all sure whether:
   
   a) this is a good fix
   b) whether I am not masking some real problem that might appear under heavy load in real Airflow production system
   
   Since you are the experts in this area, maybe you can try to answer those questions?
   
   I captured example flaky tests in #14778 #14773 #14772 (you can see stack traces there).
   
   I have several hypothesis why those flaky tests could appear. Maybe you can review/check/veriy my changes and the hypothesis:
   
   * Hypothesis 1):  There is a delay between scheduler actually parsing the dag and saving it into SerializedDagModel. Under heavy load it might get slightly delayed and hence the row is not saved there. My changes are assuming that this hypothesis is valid - but maybe the fix could be done better (I.e. waiting for the scheduler to complete)
   
   * Hypothesis 2): There is another background process resulting from other tests that cleans-up the SerializedDagModel and while previous tests are already finishied (they are run sequentially within the Core test type) some of the background proces is still running and removes the model. If that's the case, then this is a wrong fix. We need to find out a away to get rid of the cross-tests side effects between tests and find a way to kill all the background threads in tearDown/fixture of the tests.
   
   * Hypothesis 3): There is a delay in MySQL/Postgres DB between the actuall commit/flush and the time when the subsequent transaction can read the data. This is - to be honest most interesting one and if that hypothesis is true, my fix is not good either. We should find a way to be sure that the change is actually writtten to the database and available for other sessions to make the tests stable.
   
   
   I would love to hear your opinion @kaxil @ash - any other hypotheses? What do you think about those hypotheses I have and the fix? 
   
   I am happy to test it in my parallel-test change if you have other proposals. Those problems appear often enough to be reproducible there and I would love to wait with merging the parallel change until those flaky scheduler tests are fixed. 
   
   I am also happy to do similar step-by-step process for the other tests in scheduler (there few more marked as quarantine) but we can treat them one-by-one and fix all of them before I merge the parallel tests change
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799768287


   Ot should it be modified ? I see there is.some.additional diagnostics in there when executors are closed 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799799077


   (the questions below are mostly queries not comments :) )
   
   I am just trying to understand the `__del__` method here. When will it be called? Only during gc or when killing the process?
   
   and since we aren't ending the `self.executor` in `__del__`, how does this PR replace https://github.com/apache/airflow/pull/14085
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-800714372


   Hey @ashb  @kaxil - I think this last iteration should be good to go. I struggled for a while with the count() method and it turned out that this was actually the only method where I forgot to add tearDown() method. I pushed the "paralle run" based on this here to verify:  https://github.com/apache/airflow/pull/14531 and I have high hopes it will be all green this time. It was failing in 1-2 jobs always, so if it gets green it means that the problem is solved by ciombination of tearDown scheduledJob ending and the "serialization update/fetch" interval set to 0.
   
   The good news is that the perallel tests are very "sensitive" and it will be much easier to spot problems if someone introduce similar problems with cross-test side effects. I really hope we can finally merge it tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799750032


   > For example changing counts of queries
   
   I think is the only exception -- queries would only be counted from the _running_ process -- other processes (sub or zombie) won't be able to affect query count.
   
   The rest of what you said sounds possible though.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-800262758


   Hey @kaxil There were very tiny changes needed (maybe not entirely needed but I think make it more predictable during tests) if you can take a look: https://github.com/apache/airflow/commit/49910097ec88fd4298bc5fe8c9e14d3cbc8f13eb
   
   This change is more "predictable" when it comes to potential time-skew in case DB time is slightly different than python time (should not happen in tests but it also avoids some extra checks.
   
   * The check `min_update_interval` was for `None` which I believe was not possible to set via `int` config.  And it avoids an unnecessary query.
   * Similarly additional ==0 explicitly in `min_serialized_dag_fetch_secs == 0`. 
   
   If you think those make sense I will continue testing this evening and try to work out all the small things. I will also add the intervals to 0 in all relevant tests to make them predictable. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799835271


   When I run it locally (and change the expected number) I got exactly 10 calls for the serialized_dag as expected:
   
   So somewhere, something executes one more seriaized_daag get() when the tests are run in parallell. Any guess? Is there a possibiliy of re-parsing/re-retrieving the the serialized_dag after some time ?  (we have max_run set to 1). I doubt it is about some side effect from other tests (however it is possible as well). 
   
   ```
   E           AssertionError: The expected number of db queries is 196. The current number is 195.
   E           
   E           Recorded query locations:
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1853:	1
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1885:	1
   E           	scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks>taskinstance.py:__repr__:899:	100
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks:1912:	1
   E           	scheduler_job.py:_do_scheduling>retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags:1592:	1
   E           	retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs:1608:	1
   E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:229:	10
   E           	scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs>dag.py:create_dagrun:1781:	10
   E           	dag.py:create_dagrun>dagrun.py:verify_integrity>dagrun.py:get_task_instances:329:	10
   E           	scheduler_job.py:_create_dag_runs>dag.py:create_dagrun>dagrun.py:verify_integrity:674:	10
   E           	scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs>scheduler_job.py:_update_dag_next_dagruns:1659:	1
   E           	scheduler_job.py:_do_scheduling>retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags:1595:	1
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:_do_scheduling:1533:	1
   E           	scheduler_job.py:_schedule_dag_run>scheduler_job.py:_verify_integrity_if_dag_changed>serialized_dag.py:get_latest_version_hash:285:	10
   E           	dagrun.py:update_state>dagrun.py:task_instance_scheduling_decisions>dagrun.py:get_task_instances:329:	10
   E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_schedule_dag_run>dagrun.py:update_state:480:	10
   E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_schedule_dag_run>dagrun.py:schedule_tis:761:	10
   E           	test_scheduler_job.py:test_execute_queries_count_with_harvested_dags>scheduler_job.py:_run_scheduler_loop>scheduler_job.py:_do_scheduling:1549:	1
   E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>pool.py:slots_stats:107:	1
   E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>pool.py:slots_stats:111:	1
   E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued:949:	1
   E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued><string>:<lambda>:1:	1
   E           	scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued>scheduler_job.py:__get_concurrency_maps:895:	1
   E           	scheduler_job.py:_do_scheduling>scheduler_job.py:_critical_section_execute_task_instances>scheduler_job.py:_executable_task_instances_to_queued:1090:	1
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799826498


   Also (side comment) 
   
   This is likely good one for optimisation:
   
   ```
   	scheduler_job.py:_run_scheduler_loop>scheduler_job.py:adopt_or_reset_orphaned_tasks>taskinstance.py:__repr__:899:	100
   ```
   
   i Bet this can be grouped in a single query (but this is a different story)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799257444


   Hey @ashb @kaxil  -> The #14351 created a good opportunity to test some race conditions for some of the Flaky tests in scheduler and I would like to fix those flaky tests before I merge the parallel change.
   
   I extracted this change out of the 'parallel run change' those were my attempts to make the tests non-flaky. 
   
   The attempts helped - I managed to run successfully several test-passes without the problems - both in parallell self-hosted instances as well as in Public Runners, but I am not at all sure whether:
   
   a) this is a good fix
   b) whether I am not masking some real problem that might appear under heavy load in real Airflow production system
   
   Since you are the experts in this area, maybe you can try to answer those questions?
   
   I captured example flaky tests in #14778 #14773 #14772 (you can see stack traces there).
   
   I have several hypothesis why those flaky tests could appear. Maybe you can review/check/veriy my changes and the hypothesis:
   
   * Hypothesis 1):  There is a delay between scheduler actually parsing the dag and saving it into SerializedDagModel. Under heavy load it might get slightly delayed and hence the row is not saved there. My changes are assuming that this hypothesis is valid.
   
   * Hypothesis 2): There is another background process resulting from other tests that cleans-up the SerializedDagModel and while previous tests are already finishied (they are run sequentially within the Core test type) some of the background proces is still running and removes the model. If that's the case, then this is a wrong fix. We need to find out a away to get rid of the cross-tests side effects between tests and find a way to kill all the background threads in tearDown/fixture of the tests.
   
   * Hypothesis 3): There is a delay in MySQL/Postgres DB between the actuall commit/flush and the time when the subsequent transaction can read the data. This is - to be honest most interesting one and if that hypothesis is true, my fix is not good either. We should find a way to be sure that the change is actually writtten to the database and available for other sessions to make the tests stable.
   
   
   I would love to hear your opinion @kaxil @ash - any other hypotheses? What do you think about those hypotheses I have and the fix? 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799704028


   I am running now full suite of tests with #114531 few times on my machine as well to test flakiness - with the parallel runs it should take not more than 25 minutes to run it all 5 times.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-801443712


   Hey @ashb  @kaxil . I think I nailed it finally. Last time when I run it  in parallel, everything succeeded. I improved the parallel run additionally and I am running it now, but I would love to merge this now.
   
   Summarising the fixes:
   
   * we cannot use __dell__ as SchedulerJob is an SQLAlchemy managed object - and those two do not work well
   
   * I updated all tests' setUp and tearDown to "end" SchedulerJob processes. That seems to help in most cases except the one count test with 195 updates.
   
   * rather than set to 0 the "min_update/min_fetch" intervals I set it to 100 in the count test. And that finally fixed the stability of that count test I think. The problem was that the test run together with others run long enough that there was an extra "get" from serialized dags - that would normally not happen because of "min_fetch_interval" was actually too low it seems.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#discussion_r594210751



##########
File path: tests/jobs/test_scheduler_job.py
##########
@@ -2638,6 +2638,7 @@ def test_scheduler_verify_pool_full_2_slots_per_task(self):
                 execution_date=date,
                 state=State.RUNNING,
             )
+            SerializedDagModel.write_dag(dag)

Review comment:
       Please move outside the loop to L2633

##########
File path: tests/jobs/test_scheduler_job.py
##########
@@ -2589,8 +2589,8 @@ def test_scheduler_verify_pool_full(self):
             execution_date=dag.following_schedule(dr.execution_date),
             state=State.RUNNING,
         )
+        SerializedDagModel.write_dag(dag)

Review comment:
       This should be above line 2586 I think

##########
File path: tests/jobs/test_scheduler_job.py
##########
@@ -2822,6 +2824,7 @@ def test_verify_integrity_if_dag_changed(self):
         assert scheduler.dagbag.dags == {'test_verify_integrity_if_dag_changed': dag}
         assert len(scheduler.dagbag.dags.get("test_verify_integrity_if_dag_changed").tasks) == 1
 
+        SerializedDagModel.write_dag(dag=dag)

Review comment:
       This one doesn't seem like it would be needed as two lines later we write it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-800714372


   Hey @ashb  @kaxil- I think this last iteration should be good to go. I struggled for a while with the count() method and it turned out that this was actually the only method where I forgot to add tearDown() method. I pushed the "paralle run" based on this here to verify:  https://github.com/apache/airflow/pull/14531 and I have high hopes it will be all green this time. It was failing in 1-2 jobs always, so if it gets green it means that the problem is solved by ciombination of tearDown scheduledJob ending and the "serialization update/fetch" interval set to 0.
   
   The good news is that the perallel tests are very "sensitive" and it will be much easier to spot problems if someone introduce similar problems with cross-test side effects. I really hope we can finally merge it tomorrow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799759936


   It should be there anyway -- see also https://github.com/apache/airflow/pull/14085
   
   The SchedulerJob instance is always in scope right until `airflow scheduler` process is about to exit anyway.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #14792:
URL: https://github.com/apache/airflow/pull/14792


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799767732


   Btw. I f we add _del_  i believe.rhe #14805 od not needed and should not be merged ? Do i read it right ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799257444


   Hey @ashb @kaxil  -> The #14531 created a good opportunity to test some race conditions for some of the Flaky tests in scheduler and I would like to fix those flaky tests before I merge the parallel change.
   
   I extracted this change out of the 'parallel run change' those were my attempts to make the tests non-flaky. 
   
   The attempts helped - I managed to run successfully several test-passes without the problems - both in parallell self-hosted instances as well as in Public Runners, but I am not at all sure whether:
   
   a) this is a good fix
   b) whether I am not masking some real problem that might appear under heavy load in real Airflow production system
   
   Since you are the experts in this area, maybe you can try to answer those questions?
   
   I captured example flaky tests in #14778 #14773 #14772 (you can see stack traces there).
   
   I have several hypothesis why those flaky tests could appear. Maybe you can review/check/veriy my changes and the hypothesis:
   
   * Hypothesis 1):  There is a delay between scheduler actually parsing the dag and saving it into SerializedDagModel. Under heavy load it might get slightly delayed and hence the row is not saved there. My changes are assuming that this hypothesis is valid - but maybe the fix could be done better (I.e. waiting for the scheduler to complete)
   
   * Hypothesis 2): There is another background process resulting from other tests that cleans-up the SerializedDagModel and while previous tests are already finishied (they are run sequentially within the Core test type) some of the background proces is still running and removes the model. If that's the case, then this is a wrong fix. We need to find out a away to get rid of the cross-tests side effects between tests and find a way to kill all the background threads in tearDown/fixture of the tests.
   
   * Hypothesis 3): There is a delay in MySQL/Postgres DB between the actuall commit/flush and the time when the subsequent transaction can read the data. This is - to be honest most interesting one and if that hypothesis is true, my fix is not good either. We should find a way to be sure that the change is actually writtten to the database and available for other sessions to make the tests stable.
   
   
   I would love to hear your opinion @kaxil @ash - any other hypotheses? What do you think about those hypotheses I have and the fix? 
   
   I am happy to test it in my parallel-test change if you have other proposals. Those problems appear often enough to be reproducible there and I would love to wait with merging the parallel change until those flaky scheduler tests are fixed. 
   
   I am also happy to do similar step-by-step process for the other tests in scheduler (there few more marked as quarantine) but we can treat them one-by-one and fix all of them before I merge the parallel tests change
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799257444


   Hey @ashb @kaxil  -> The #14351 created a good opportunity to test some race conditions for some of the Flaky tests in scheduler and I would like to fix those flaky tests before I merge the parallel change.
   
   I extracted this change out of the 'parallel run change' those were my attempts to make the tests non-flaky. 
   
   The attempts helped - I managed to run successfully several test-passes without the problems - both in parallell self-hosted instances as well as in Public Runners, but I am not at all sure whether:
   
   a) this is a good fix
   b) whether I am not masking some real problem that might appear under heavy load in real Airflow production system
   
   Since you are the experts in this area, maybe you can try to answer those questions?
   
   I captured example flaky tests in <#14778 #14773 #14772 (you can see stack traces there).
   
   I have several hypothesis why those flaky tests could appear. Maybe you can review/check/veriy my changes and the hypothesis:
   
   * Hypothesis 1):  There is a delay between scheduler actually parsing the dag and saving it into SerializedDagModel. Under heavy load it might get slightly delayed and hence the row is not saved there. My changes are assuming that this hypothesis is valid.
   
   * Hypothesis 2): There is another background process resulting from other tests that cleans-up the SerializedDagModel and while previous tests are already finishied (they are run sequentially within the Core test type) some of the background proces is still running and removes the model. If that's the case, then this is a wrong fix. We need to find out a away to get rid of the cross-tests side effects between tests and find a way to kill all the background threads in tearDown/fixture of the tests.
   
   * Hypothesis 3): There is a delay in MySQL/Postgres DB between the actuall commit/flush and the time when the subsequent transaction can read the data. This is - to be honest most interesting one and if that hypothesis is true, my fix is not good either. We should find a way to be sure that the change is actually writtten to the database and available for other sessions to make the tests stable.
   
   
   I would love to hear your opinion @kaxil @ash - any other hypotheses? What do you think about those hypotheses I have and the fix? 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799849287


   unfortunately with the`__del__` solution we have again flakines when run in parallel. There must be another reason for this. I am reverting back now to ecplicit scheduler stopping and will see if I can reproduce this.
   
   https://github.com/apache/airflow/pull/14531/checks?check_run_id=2117086265#step:6:10807
   
   ```
     _______________ TestSchedulerJob.test_scheduler_verify_pool_full _______________
     
     self = <tests.jobs.test_scheduler_job.TestSchedulerJob testMethod=test_scheduler_verify_pool_full>
     
         def test_scheduler_verify_pool_full(self):
             """
             Test task instances not queued when pool is full
             """
             dag = DAG(dag_id='test_scheduler_verify_pool_full', start_date=DEFAULT_DATE)
         
             BashOperator(
                 task_id='dummy',
                 dag=dag,
                 owner='airflow',
                 pool='test_scheduler_verify_pool_full',
                 bash_command='echo hi',
             )
         
             dagbag = DagBag(
                 dag_folder=os.path.join(settings.DAGS_FOLDER, "no_dags.py"),
                 include_examples=False,
                 read_dags_from_db=True,
             )
             dagbag.bag_dag(dag=dag, root_dag=dag)
             dagbag.sync_to_db()
         
             session = settings.Session()
             pool = Pool(pool='test_scheduler_verify_pool_full', slots=1)
             session.add(pool)
             session.flush()
         
             dag = SerializedDAG.from_dict(SerializedDAG.to_dict(dag))
             SerializedDagModel.write_dag(dag)
         
             scheduler = SchedulerJob(executor=self.null_exec)
             scheduler.processor_agent = mock.MagicMock()
         
             # Create 2 dagruns, which will create 2 task instances.
             dr = dag.create_dagrun(
                 run_type=DagRunType.SCHEDULED,
                 execution_date=DEFAULT_DATE,
                 state=State.RUNNING,
             )
     >       scheduler._schedule_dag_run(dr, {}, session)
     
     tests/jobs/test_scheduler_job.py:2585: 
     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
     airflow/jobs/scheduler_job.py:1707: in _schedule_dag_run
         dag = dag_run.dag = self.dagbag.get_dag(dag_run.dag_id, session=session)
     airflow/utils/session.py:62: in wrapper
         return func(*args, **kwargs)
     airflow/models/dagbag.py:178: in get_dag
         self._add_dag_from_db(dag_id=dag_id, session=session)
     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
     
     self = <airflow.models.dagbag.DagBag object at 0x7f1cf7b23760>
     dag_id = 'test_scheduler_verify_pool_full'
     session = <sqlalchemy.orm.session.Session object at 0x7f1cf781e310>
     
         def _add_dag_from_db(self, dag_id: str, session: Session):
             """Add DAG to DagBag from DB"""
             from airflow.models.serialized_dag import SerializedDagModel
         
             row = SerializedDagModel.get(dag_id, session)
             if not row:
     >           raise SerializedDagNotFound(f"DAG '{dag_id}' not found in serialized_dag table")
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799800650


   > (the questions below are mostly queries not comments :) )
   > 
   > I am just trying to understand the `__del__` method here. When will it be called? Only during gc or when killing the process?
   
   It's called whenever python detects the object is unreachable. Python has a "hybrid" approach to garbage collection. Primarily it uses reference counting, so (assuming there is no loop in objects) the __del__ will be called as soon as the SchedulerJob goes out of scope (or when the variable is re-assigned in tests).
   
   
   
   > 
   > and since we aren't ending the `self.executor` in `__del__`, how does this PR replace #14085
   
   "If you add executor to the __del__, then yes"


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-799849287


   unfortunately with the`__del__` solution we have again flakines when run in parallel. There must be another reason for this. I am reverting back now to ecplicit scheduler stopping and will see if I can reproduce this.
   
   I am trying to rever tit 
   
   
   https://github.com/apache/airflow/pull/14531/checks?check_run_id=2117086265#step:6:10807
   
   ```
     _______________ TestSchedulerJob.test_scheduler_verify_pool_full _______________
     
     self = <tests.jobs.test_scheduler_job.TestSchedulerJob testMethod=test_scheduler_verify_pool_full>
     
         def test_scheduler_verify_pool_full(self):
             """
             Test task instances not queued when pool is full
             """
             dag = DAG(dag_id='test_scheduler_verify_pool_full', start_date=DEFAULT_DATE)
         
             BashOperator(
                 task_id='dummy',
                 dag=dag,
                 owner='airflow',
                 pool='test_scheduler_verify_pool_full',
                 bash_command='echo hi',
             )
         
             dagbag = DagBag(
                 dag_folder=os.path.join(settings.DAGS_FOLDER, "no_dags.py"),
                 include_examples=False,
                 read_dags_from_db=True,
             )
             dagbag.bag_dag(dag=dag, root_dag=dag)
             dagbag.sync_to_db()
         
             session = settings.Session()
             pool = Pool(pool='test_scheduler_verify_pool_full', slots=1)
             session.add(pool)
             session.flush()
         
             dag = SerializedDAG.from_dict(SerializedDAG.to_dict(dag))
             SerializedDagModel.write_dag(dag)
         
             scheduler = SchedulerJob(executor=self.null_exec)
             scheduler.processor_agent = mock.MagicMock()
         
             # Create 2 dagruns, which will create 2 task instances.
             dr = dag.create_dagrun(
                 run_type=DagRunType.SCHEDULED,
                 execution_date=DEFAULT_DATE,
                 state=State.RUNNING,
             )
     >       scheduler._schedule_dag_run(dr, {}, session)
     
     tests/jobs/test_scheduler_job.py:2585: 
     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
     airflow/jobs/scheduler_job.py:1707: in _schedule_dag_run
         dag = dag_run.dag = self.dagbag.get_dag(dag_run.dag_id, session=session)
     airflow/utils/session.py:62: in wrapper
         return func(*args, **kwargs)
     airflow/models/dagbag.py:178: in get_dag
         self._add_dag_from_db(dag_id=dag_id, session=session)
     _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
     
     self = <airflow.models.dagbag.DagBag object at 0x7f1cf7b23760>
     dag_id = 'test_scheduler_verify_pool_full'
     session = <sqlalchemy.orm.session.Session object at 0x7f1cf781e310>
     
         def _add_dag_from_db(self, dag_id: str, session: Session):
             """Add DAG to DagBag from DB"""
             from airflow.models.serialized_dag import SerializedDagModel
         
             row = SerializedDagModel.get(dag_id, session)
             if not row:
     >           raise SerializedDagNotFound(f"DAG '{dag_id}' not found in serialized_dag table")
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
kaxil commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-800225965


   > Hey @kaxil @ashb -> I am still chasing the remaining "count" problems and while I have not solved them all, I think most of them come from the serialized_dag missing in the SerializedDag table.
   > 
   > Is there a way we can make sure the serialized DAG will be immediately written to the database immediately?
   > 
   > I have a suspicion it is around min_update/fetch interval and I am trying this one:
   > 
   > [4991009](https://github.com/apache/airflow/commit/49910097ec88fd4298bc5fe8c9e14d3cbc8f13eb)
   > 
   > But I am not sure if this is the right fix.
   > 
   > More info/good example here:
   > 
   > https://github.com/apache/airflow/runs/2117788392?check_suite_focus=true#step:6:9982
   > 
   > ```
   >   E           	scheduler_job.py:_do_scheduling>retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags:1573:	1
   >   E           	retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs:1592:	1
   >   E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:229:	3
   >   E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:235:	2
   >   E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:237:	2
   > ```
   > 
   > The last two lines never appear when you run the tests locally.
   > 
   > Lines of code:
   > 
   > ```
   > 229  row = session.query(cls).filter(cls.dag_id == dag_id).one_or_none()
   >         if row:
   >             return row
   > 
   >         # If we didn't find a matching DAG id then ask the DAG table to find
   >         # out the root dag
   > 235  root_dag_id = session.query(DagModel.root_dag_id).filter(DagModel.dag_id == dag_id).scalar()
   > 
   > 237  return session.query(cls).filter(cls.dag_id == root_dag_id).one_or_none()
   > ```
   
   Setting `min_serialized_dag_update_interval` to `0` should do it -- shouldn't need any code change for that -- only tests (I think)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-800715451


   I also improved slightly diagnostic on 'failed count" cases - we will see line numbers from all stack traces in case the count is different so it will be easy to spot any cases where one of the top methods run some `if` statement and we have an execution from another line of code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14792: Fixes some of the flaky tests in test_scheduler_job

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14792:
URL: https://github.com/apache/airflow/pull/14792#issuecomment-800112254


   Hey @kaxil @ashb  -> I am still chasing the remaining "count" problems and while I have not solved them all, I think  most of them come from the serialized_dag missing in the SerializedDag  table.
   
   Is there a way we can make sure the serialized DAG will be immediately written to the database immediately?
   
   I have a suspicion it is around min_update/fetch interval  and I am trying this one:
   
   https://github.com/apache/airflow/pull/14792/commits/49910097ec88fd4298bc5fe8c9e14d3cbc8f13eb
   
   But I am not sure if this is the right fix.
   
   More info/good example here:
   
   https://github.com/apache/airflow/runs/2117788392?check_suite_focus=true#step:6:9982 
   
   ```
     E           	scheduler_job.py:_do_scheduling>retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags:1573:	1
     E           	retries.py:wrapped_function>scheduler_job.py:_create_dagruns_for_dags>scheduler_job.py:_create_dag_runs:1592:	1
     E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:229:	3
     E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:235:	2
     E           	dagbag.py:get_dag>dagbag.py:_add_dag_from_db>serialized_dag.py:get:237:	2
   ```
   
   The last two lines never appear when you run the tests locally.
   
   Lines of code:
   
   ```
   229  row = session.query(cls).filter(cls.dag_id == dag_id).one_or_none()
           if row:
               return row
   
           # If we didn't find a matching DAG id then ask the DAG table to find
           # out the root dag
   235  root_dag_id = session.query(DagModel.root_dag_id).filter(DagModel.dag_id == dag_id).scalar()
   
   237  return session.query(cls).filter(cls.dag_id == root_dag_id).one_or_none()
   ```
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org