You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/02/28 17:25:33 UTC

[GitHub] [airflow] potiuk opened a new pull request #14531: Running tests in parallel for self-hosted runners

potiuk opened a new pull request #14531:
URL: https://github.com/apache/airflow/pull/14531


   This is by far the biggest improvements of the test execution time
   we can get now when we are using self-hosted runners.
   
   This change drives down the time of executing all tests on
   self-hosted runners from >60 minutes to <15 minutes due to heavy
   parallelisation we can implement for different test types and the
   fact that our machines for self-hosted runners are far more
   capable - they have more CPU, more memory and the fact that
   we are using tmpfs for everything.
   
   This change will also drive the cost of our self-hosted runners
   down. Since we have auto-scaling infrastructure we will simply need
   the machines to run tests for far shorter time. Since the number
   of test jobs we run on those self hosted runners is substantial
   (10 jobs), we are going to save ~ 6 build hours per one PR/merged
   commit!
   
   This also allows the developers to use the power of their
   development machines - if they have bigger number of CPUs,
   they can also run full test suite of Airflow in a few minutes
   by setting `RUN_TESTS_IN_PARALLEL` variable to `true`.
   
   On one personal PC (64GB RAM, 8 CPUS/16 cores, fast SSD) the full
   test suite execution went down from 45 minutes to 8 minutes.
   
   Another benefit of this change is that output from tests is only
   shown in full in CI, when there is an error. There is a
   continuous progress information printed every 10 seconds when
   either parallel or sequential tests are run, but the full output
   is only shown for test types which failed - at the end of the
   output, clearly marked and groupped in "Red" groups so that
   it will be much easier to see what are the root causes of
   the test failures. This works by default for both - sequential
   and parallell test execution, but for sequential execution, it
   can be chnged to the previous mode, where output is always fully
   printed during the execution by setting the variable
   `TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR` to `true`
   
   <!--
   Thank you for contributing! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   In case of existing issue, reference it using one of the following:
   
   closes: #ISSUE
   related: #ISSUE
   
   How to write a good git commit message:
   http://chris.beams.io/posts/git-commit/
   -->
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)** for more information.
   In case of fundamental code change, Airflow Improvement Proposal ([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)) is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in [UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r586587023



##########
File path: .github/workflows/ci.yml
##########
@@ -35,6 +35,8 @@ env:
   DB_RESET: "true"
   VERBOSE: "true"
   DOCKER_CACHE: "pulled"
+  # Do not show all output lines on CI. Only output of the failed tests will be shown as a summary
+  TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR: "false"

Review comment:
       I thought about it and was not sure either. I simply find it rather unnecessary to print all the output for sequential tests.  They take a lot of log lines usually. But yeah. We can keep it. This output is groupped per "test type" so it will be folded anyway and agree that it might be useful at times.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789386354


   > Have you looked at pytest-xdist?
   
   As I already explained several times - yes, we looked at this several times in the past  by different people. First time 2 years ago and we got scared by the results. The result were always the same - it failed big time. You can try yourself if you want.
   
   Our tests (big number of them)  rely on shared database which is reset and reinitialized and filled wth data/cleaned by multiple tests. The shared database is shared state resource and if we try to run tests in parallel using xdist, they override each other data and fail completely randomly. There is no way to run parallel tests sharing the same database (unless we completely redefine our tests and always mock the database or somehow isolate all the tests that use the DB. 
   
   Particularly quite  lot of tests from the core (including the scheduler ones - I guess you are familiar with those) actually use the DB and are not ready to be run in parallel. I think you can imagine best what starts happening if you run those multiple scheduler tests against the same database in parallel. But those are not the only ones. Airflow test suite encourages people to use the DB during their testing.
   
   There are also a number of tests that rely on side effects from other tests (stored in the same database).  
   
   My solution solves it in the way that every type of tests has its own database (and in each "type group" the tests are run sequentially). Then  - groups are run in parallel. This way they cannot override each other data and since I already run each group separately in the 'sequential' approach, I know that side effects are at least 'under control' (i.e. they do not show up).
   
   Of course if we live in ideal world we would have no side effects and no shared database, in which case pytest-xdist would work for us. But, unfortunately our small Airflow world is not perfect. And fixing it to be perfect and ideal could likely take weeks or months of work by isolating, fixing and mocking out all the tests. It's a great regret of mine, but I believe noone in the community has time for it, so I prefer to implement what can give immediate effect and can be implemented with very small effort.
   
   But I do encourage you to take on the task and fix all the tests. That would be perfect to have it at some point in time. Also it would likely drive down the time needed to run the tests even more.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-792652077


   k8s and backfill I have noticed occasionally too. It's _better_ on self hosted runners but not gone entirely.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r591415464



##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       Sure. This tests reads value of the airflow configuration and returns it to the caller. The test failed (when I run it in parallel) because the config was missing. The web api returns the "production" config - no matter if the UNIT_TEST_MODE is set to True or not. I have not yet had time to figure out why the "airflow.cfg" is missing when I run the test in parallel mode (it is there when we run tests sequentially). 
   
   I still have to figure it out (and planned to). My guess is that some factor influences the sequence of executing the tests and some other test simply deletes the configuration as part of its execution and the config never gets restored when the `test_configuration_expose_config' is executed. Unfortunately presence of config is 'implicit dependency' in many of our tests - the problem is that the config gets automatically created by airflow configuration code when it is missing, so if any test depend on its presence it might went unnoticed that the config is actually a side effect of either other tests or simply running airflow cli. 
   
   So I believe my fix is `proper` fix (but there are probably other tests that have similar side-effects dependency). Since this test depends on the 'airflow.cfg` being present, this should not be side-efffect, it should be explicitly initialised int 'setUp' of the test. Which is precisely what I do - I make sure that the "unit_test_mode" is set to false, so the "production" and not "unittest" config is initialized and then I Initialise it.
   
   I think this is how the test should be written in the first place.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r586059754



##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       Yes. Otherwise one of the WWW tests fail when WWW is run in full isolation. This test is written with the assumption that the config is created but if it is run in isolation, the config does not exist. This is fix to those tests - making sure that those tests do not rely on side effect from other tests (this configuration file is written by other tests that's why it has never been caught before.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789691674


   > > If yes, then the Launch Template in eu-central-1 defines the instance size, and the user-data defines how big we mount the /var/lib/docker -- cloud-init.yaml lives in the airflow-ci-infra repo and (for now) needs to be manually updated in AWS on change. Command is in the commit that added it.
   > 
   > Do I need to do anything - like restart the machines etc. to be sure that new changes are in effect?
   
   Restarting the machines would kill in-progress jobs. If you just leave it, then over time as they become idle and get replaced new ones will come up. So no, just make the change and wait is the best plan.
   
   I recommend adding a column of "LaunchTemplate version" to the EC2 Instances dashboard, or I have this snippet I run to see what instances are doing.
   
   ```bash
   watch "~/.virtualenvs/awscli/bin/aws --profile airflow autoscaling describe-auto-scaling-groups | jq '.AutoScalingGroups[0].Instances[] | {state:.LifecycleState, protected: .ProtectedFromScaleIn, ver: .LaunchTemplate.Version, id:.InstanceId}' -c | sort -r"
   ```
   
   (`protected:true` means that instance is currently executing a job).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-798906062


   OK. looks like I got it under control. It will fail ocassionaly with 137 but in similar way as in the previous setup.
   
   Summary:  Seems that when we finish, th whole tests suite will be able to complete in ~ 13 minutes rather than current ~ 53 minutes  when we are done and when you have powerful  local machine, you will be able to run whole test suite in less than 6 minutes.
   
   
   What I have done now is I added a bit more `smarts`:
   
   a) I check how much memory and CPUs are available not in the host but in the docker engine. Those might differ - especially on Mac. 
   
   b) I run most tests in parallel (as many parallel test runs as many CPUs we have), but in case we have < ~ 32 GB available as RAM in the Docker Engine, I will not run Integration tests in parallel to other tests - they are run separately, at the end, after clean-up of the docker engine remnants
   
   c) If we have > 32 GB of memory in docker engine, we run everything in parallel.
   
   The current numbers (the numbers already improved after the fantastic @jhtimmins  optimistations in WWW/API tests):
   
   ## Public GitHub runners: 
   
   * Before parallelization: ~ 53 minutes
   * After parallelization (2 parallel test streams + Integration tests sequentially) : ~34 minutes 
   
   **Overall**:  We save 20 minutes per test suite or ~ 40% (!) speedup (for the whole test suite)
   
   ## Self-hosted runners with 4 Cores/32 GB RAM+ tmpfs (current setup)
   
   * Before parallelization: ~ 42  minutes
   * After parallelization  (4 parallel test streams + Integration tests sequentially): ~ 24 minutes
   
   **Overall**: We save 18 minutes per test suite ~ 43% speedup (!)  (for the whole test suite)
   
   ## Future self-hosted runners with 8 Cores/64 GB RAM + tmpfs (those are estimates based on earlier measurements on AWS where I switched our templates temporary to use them)
   
   * Before parallelization: ~ 40  minutes
   * After parallelization  (8 parallel test streams): ~ 13 minutes
   
   **Overall**: We save 12 minutes vs the 4 core/32 GB machines which is ADDITIONAL almost ~ 50%(!) improvement  (for the whole test suite)
   
   I think this is a really nice auto-detecting setup which will not only help with CI but also will allow to run full-test suite of tests locally for those developers who  have powerful machines. The full tests suite on my machine: 8 CPUs (16 cores) + 64 GB RAM + no tmpfs (but very fast SSD disk) takes exactly 5 (!!!!) minutes. 
   
   So we can still improve it. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-795370455


   > I almost got it running with ~ 45 minutes tests in the Github Public runners. There are some hangs which I have to look at - likely due to memory usage. But maybe we can limit that further and get it a bit faster
   
   Sounds great -- no 137 failures cropping up so far?
   
   > I want to have predictable sequence of the test types. Right now the execution length of the tests varies a lot because parallel semaphore released the jobs in random order and we might have long time where only last, longest job is run. I would rather get the longest jobs start first followed by the shortest ones as this will give better parallelism use and predictability. will have to take a look into that.
   
   👍🏻 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-788225228


   https://github.com/apache/airflow/pull/14531/files#diff-e864fe639f477ba3ff9ff3edf8dbc0d71af074ddcd7d8d4870b5fa41f1950454R19


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-794819110


   [The Workflow run](https://github.com/apache/airflow/actions/runs/637961774) is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-798939967


   [The Workflow run](https://github.com/apache/airflow/actions/runs/651685483) is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-796598358


   Later on ( next few weekends I guess) I will take a stab on https://github.com/apache/airflow/issues/14656 -> I think introducing two types of self-hosted runners will help us much further in driving the costs significantly down and we will be able to upgrade the self-hosted to 8CPU /64 and get the full benefit of 15 minutes test execution (or we can decide to do it even now). But it hurrs seing 8CPU 64 GB machine used to build an image or wait for images being built. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-788222127


   We can always switch back. I implemented it in the way that 'parallel' mode is enabled for self-hosted runners but it can be switched off at any time by commenting out one line - see the PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789392406


   > Have you double checked the build time with this change on one of the proposed AWS instances? I'm nervous of blowing throw our AWS credits in 1 month if this change _doesn't_ help.
   
   Yes. I already explained it when I created the PR (just scroll the PR few pages up).: https://github.com/apache/airflow/pull/14531#issuecomment-787506971 but I can explain it again and again, and again no problem. I am super patient person, though sometimes I am quite surprised that I have to answer questions that already have been answered (and ones that you even commented on).
   
   This is not the first time over the last few weeks when I shared the data with you. The aim of this change (as I also explained already several time is to actually)  is to drive the cost down as well as feedback time. This was my goal from the very beginning when I proposed to use big memory machines and tmpfs (again that was from my experience doing very similar optimizations in the past).
   
   And I even did some initial estimations which I shared with you in this document when I actually run those tests manually https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#heading=h.mkysbyj2zg1t  - you even commented on that document. So answering your short question in long form - yes I did, and you had all chances to see all the results before. 
   
   With the current 32 GB instances the whole suite of tests runs 26 minutes instead of > 1 hour with this solution. It fails and runs slower than it can because it swaps out memory (it needs around 35 GB of memory when I tested it locally)  so it will be much faster when we have bigger memory. My rough estimations is that it will run for ~ 15 minutes. But for sure it will not run for > 30 minutes which means that we save at least 6 build hours with every PR.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-795322986


   > Before I merged this one I added one more change. This one should allow to dynamically choose (using gnu parallel --semaphore) how many tests we run in parallel depending on the number of CPUs (overrideable with `MAX_PARALLEL_TEST_JOBS` . This should allow to get ~20 minutes even with smaller machines I believe. Also it is very likely (testing it in https://github.com/potiuk/airflow/actions/runs/637728789 that even with default 2 CPUs for public GitHub Runners we should be able to achieve ~2x speedup. That would be awesome.
   
   Just update on this one:
   
   * indeed the parallel semaphore to limit the number of parallel test types works pretty nicely. With the 4CPU, 32GB machines we get ~ 30 minutes tests which is pretty much what I would expect.
   * The nice thing is that the tests will automatically adjust parallelism level depending on number of CPUs/Cores available which is also great for developers running the tests on they development machines
   * I almost got it running with ~ 45 minutes tests in the Github Public runners. There are some hangs which I have to look at - likely due to memory usage. But maybe we can limit that further. 
   * there are still few tests failing (but there is a different reason for that I will talk to @kaxil about it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789302736


   > Hey @ashb -> would you update the machines to the bigger ones? Or can I do it myself? How do we proceed ?
   
   Have you double checked the build time with this change on one of the proposed AWS instances? I'm nervous of blowing throw our AWS credits in 1 month if this change _doesn't_ help.
   
   If yes, then the Launch Template in eu-central-1 defines the instance size, and the user-data defines how big we mount the /var/lib/docker -- cloud-init.yaml lives in the airflow-ci-infra repo and (for now) needs to be manually updated in AWS on change. Command is in the commit that added it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-798906062


   OK. looks like I got it under control. It will fail ocassionaly with 137 but in similar way as in the previous setup.
   
   Summary:  Seems that when we finish, th whole tests suite will be able to complete in ~ 13 minutes rather than current ~ 53 minutes  when we are done and when you have powerful  local machine, you will be able to run whole test suite in less than 6 minutes.
   
   
   What I have done now is I added a bit more `smarts`:
   
   a) I check how much memory and CPUs are available not in the host but in the docker engine. Those might differ - especially on Mac. 
   
   b) I run most tests in parallel (as many parallel test runs as many CPUs we have), but in case we have < ~ 32 GB available as RAM in the Docker Engine, I will not run Integration tests in parallel to other tests - they are run separately, at the end, after clean-up of the docker engine remnants
   
   c) If we have > 32 GB of memory in docker engine, we run everything in parallel.
   
   The current numbers (the numbers already improved after the fantastic @jhtimmins  optimistations in WWW/API tests):
   
   ## Public GitHub runners: 
   
   * Before parallelization: ~ 53 minutes
   * After parallelization (2 parallel test streams + Integration tests sequentially) : ~34 minutes 
   
   **Overall**:  We save 20 minutes per test suite or ~ 40% (!) speedup (for the whole test suite)
   
   ## Self-hosted runners with 4 Cores/32 GB RAM+ tmpfs (current setup)
   
   * Before parallelization: ~ 42  minutes
   * After parallelization  (4 parallel test streams + Integration tests sequentially): ~ 24 minutes
   
   **Overall**: We save 18 minutes per test suite ~ 43% speedup (!)  (for the whole test suite)
   
   ## Future self-hosted runners with 8 Cores/64 GB RAM + tmpfs
   
   Those are estimates based on earlier measurements on AWS where I switched our templates temporary to use them.
   
   * Before parallelization: ~ 40  minutes
   * After parallelization  (8 parallel test streams): ~ 13 minutes
   
   **Overall**: We save 12 minutes vs the 4 core/32 GB machines which is ADDITIONAL almost ~ 50%(!) improvement  (for the whole test suite)
   
   I think this is a really nice auto-detecting setup which will not only help with CI but also will allow to run full-test suite of tests locally for those developers who  have powerful machines. The full tests suite on my machine: 8 CPUs (16 cores) + 64 GB RAM + no tmpfs (but very fast SSD disk) takes exactly 5 (!!!!) minutes. 
   
   So we can still improve it. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787506971


   Hey @ashb - we need bigger machines as I suspected :) . 
   
   The good news is that it will be much cheaper in the long run as we will need them for far less time.
   
   The tests are failing but mainly because of memory problems and timeouts (so I guess we are simply using too much of RAM , if we up the machine to 64 GB I think this should go rather smoothly. The good news is that even with not enough memory (and with failures/timeouts) the tests took ~26 m for `sqlite` - rather than > 1 h, so when we have enough memory we can achieve the 15 minutes I was hoping for. Those 64 GB machines are only a bit more expensive than the 32 GB ones, so we will save a lot of credits when it works. 
   
   We can even optimize it away a bit and have two self-hosted types:
   
   1) Big 64 GB ones for the tests
   2) Smaller 32 GB ones  for everything else
   
   It should be rather easy to configure in the `CI.yml`, but I am not sure if the auto-scaling solution we have will handle two types?
   
   Here is the job that we have partial successes/failures and it shows how those tests will look like. This is actually a good  one to show how the tests will look like.  You can see that the output is nicely grouped and you can see very clearly the monitoring and progress (it will be much nicer when we have more memory because each test will progress much faster). Also I print summary of the failed tests at the end - only "failure" outputs are fully printed to the logs at the end with "Red" groups - this will make it far easier to analyze problems (the same kind of output improvement is in the sequential version of the tests run on GitHub runners).
   
   In this  case three test types succeeded (`Heisentests Core Providers`) and remaining 5 had some failures (most of them from what I see is due to timeouts, which is perfectly understandable if we run out of memory and started to swap out to remote SSD in the cloud): https://github.com/apache/airflow/pull/14531/checks?check_run_id=1999175947. 
   
   ## Rationale for bigger machines
   
   From what I see we have machines with 32 GB and since half of it will easily be eaten by `tmpfs` when we start writing logs and the like, we only have ~16 GB which is not enough. During my tests: https://twitter.com/higrys/status/1366037359461101569/photo/1 all the tests running in parallel took ~35 GB of memory on my 64 GB machine. I had just local SSD not `tmpfs` for those tests, but i do not think we need 30 GB tmpfs for all logs, docker, tmp etc (and we can fine tune that if we do).  
   
   Also it is more important than before to clean-up the `tmpfs` volumes before each run and make them "pristine" for every run - because we will be using nearly all of it. I think that will also help with cases like #14505 where some left-overs from previous runs are causing the jobs to fail.
   
   This is the mchine state before the tests are run: 
   
   ```
                   total        used        free      shared  buff/cache   available
     Mem:           30Gi       696Mi        24Gi       3.4Gi       5.5Gi        26Gi
     Swap:            0B          0B          0B
   
     Filesystem      Size  Used Avail Use% Mounted on
     /dev/root       7.7G  2.6G  5.2G  34% /
     devtmpfs         16G     0   16G   0% /dev
     tmpfs            16G     0   16G   0% /dev/shm
     tmpfs           3.1G  804K  3.1G   1% /run
     tmpfs           5.0M     0  5.0M   0% /run/lock
     tmpfs            16G     0   16G   0% /sys/fs/cgroup
     tmpfs           3.1G  168K  3.1G   1% /tmp
     tmpfs            21G  2.9G   18G  15% /var/lib/docker
     tmpfs            16G  534M   15G   4% /home/runner/actions-runner/_work
     /dev/loop0       98M   98M     0 100% /snap/core/10185
     /dev/loop1       56M   56M     0 100% /snap/core18/1885
     /dev/loop2       71M   71M     0 100% /snap/lxd/16922
     /dev/loop3       29M   29M     0 100% /snap/amazon-ssm-agent/2012
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-796595606


   Hey @ashb - for now, I am forcing max parallel = 1 for Github Public Runners and I think we should merge it once the tests pass. I will continue improving the "predictability" and memory usage for the Public Runners so hopefully we can get it improved there as well.
   
   The Public Github Runner version of this runs here: https://github.com/potiuk/airflow/actions/runs/642275928
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789392406


   > Have you double checked the build time with this change on one of the proposed AWS instances? I'm nervous of blowing throw our AWS credits in 1 month if this change _doesn't_ help.
   
   Yes. I already explained it when I created the PR (just scroll the PR few pages up).: https://github.com/apache/airflow/pull/14531#issuecomment-787506971 but I can explain it again and again, and again no problem. I am super patient person, though sometimes I am quite surprised that I have to answer questions that already have been answered (and ones that you even commented on).
   
   This is not the first time over the last few weeks when I shared the data with you. The aim of this change (as I also explained already several time is to actually drive the cost down). This was my goal from the very beginning and I even did some initial estimations which I shared with you in this document when I actually run those tests manually https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#heading=h.mkysbyj2zg1t  - you even commented on that document. So answering your short question in long form - yes I did, and you had all chances to see all the results before. 
   
   With the current 32 GB instances the whole suite of tests runs 26 minutes instead of > 1 hour with this solution. It fails and runs slower than it can because it swaps out memory (it needs around 35 GB of memory when I tested it locally)  so it will be much faster when we have bigger memory. My rough estimations is that it will run for ~ 15 minutes. But for sure it will not run for > 30 minutes which means that we save at least 6 build hours with every PR.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-796635976


   > Later on ( next few weekends I guess) I will take a stab on #14656 -> I think introducing two types of self-hosted runners will help us much further in driving the costs significantly down and we will be able to upgrade the self-hosted to 8CPU /64 and get the full benefit of 15 minutes test execution (or we can decide to do it even now). But it hurts seing 8CPU 64 GB machine used to build an image or wait for images being built.
   
   Yeah I agree.
   
   <ins>Moved to comment in other issue.</ins>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-796635976


   > Later on ( next few weekends I guess) I will take a stab on #14656 -> I think introducing two types of self-hosted runners will help us much further in driving the costs significantly down and we will be able to upgrade the self-hosted to 8CPU /64 and get the full benefit of 15 minutes test execution (or we can decide to do it even now). But it hurts seing 8CPU 64 GB machine used to build an image or wait for images being built.
   
   Yeah I agree.
   
   Rough steps we will need to do to have two "pools" of runners:
   
   1. Get ASF infra to adjust the labels on the existing runners we have configured. Currently they just have `self-hosted` label -- we should ask them to set all 100 to additionally have a `large` label or something
   
   1. Update https://github.com/apache/airflow-ci-infra/blob/master/scripts/runner-supervisor.py to pull credentials from a different path in AWS SSM.
   
       Right now it looks for creds in `/runners/apache/airflow` -- i.e. `/runners/apache/airflow/1`...`/runners/apache/airflow/100`
   
       We would probably have to change that to look at `/runners/apache/airflow/{size}/{idx}`
   
   1. Open a ticket with Infra, and ask them to create a new runner registration token. Use that token and the https://github.com/apache/airflow-ci-infra/blob/master/scripts/store-agent-creds.py script to store those creds in SSM. Script will need updating to change/add labels and store in a different location too.
   
   1. Create a second ASG with smaller instances, also configuring/changing the userdata so the instance starts in the right "pool".
   
   1. The Lambda function will need updating to know which ASG to scale at based on the job name. I _think_ we might have to just duplicate the job ids and sizes in to the runner, as we don't get much info in the webhook payload (I've included a sample one below in the details block)
   
       There may be a better approach here.
   
   
   I'm more than happy to help or do all of the above myself if you'd like, just laying out the steps that come to mind.
   
   
   <details>
   <summary>Webhook payload</summary>
   <pre><code>{
     "action": "created",
     "check_run": {
       "id": 2085234698,
       "node_id": "MDg6Q2hlY2tSdW4yMDg1MjM0Njk4",
       "head_sha": "9070ba280253e918c2582f52c2b77a07c311198f",
       "external_id": "ff9e9115-6980-59bb-9ae7-308d64bdf685",
       "url": "https://api.github.com/repos/apache/airflow/check-runs/2085234698",
       "html_url": "https://github.com/apache/airflow/runs/2085234698",
       "details_url": "https://github.com/apache/airflow/runs/2085234698",
       "status": "queued",
       "conclusion": null,
       "started_at": "2021-03-11T10:09:49Z",
       "completed_at": null,
       "output": {
         "title": null,
         "summary": null,
         "text": null,
         "annotations_count": 0,
         "annotations_url": "https://api.github.com/repos/apache/airflow/check-runs/2085234698/annotations"
       },
       "name": "K8s 3.7 v1.16.9 image",
       "check_suite": {
         "id": 2233450802,
         "node_id": "MDEwOkNoZWNrU3VpdGUyMjMzNDUwODAy",
         "head_branch": "master",
         "head_sha": "9070ba280253e918c2582f52c2b77a07c311198f",
         "status": "queued",
         "conclusion": null,
         "url": "https://api.github.com/repos/apache/airflow/check-suites/2233450802",
         "before": "943baff6701f9f8591090bf76219571d7f5e2cc5",
         "after": "9070ba280253e918c2582f52c2b77a07c311198f",
         "pull_requests": [
           {
             "url": "https://api.github.com/repos/rpatil524/airflow/pulls/80",
             "id": 581521861,
             "number": 80,
             "head": {
               "ref": "master",
               "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
               "repo": {
                 "id": 33884891,
                 "url": "https://api.github.com/repos/apache/airflow",
                 "name": "airflow"
               }
             },
             "base": {
               "ref": "master",
               "sha": "fff344443f764a0a5c9f5ba7dfbfa1e5e1115348",
               "repo": {
                 "id": 336576614,
                 "url": "https://api.github.com/repos/rpatil524/airflow",
                 "name": "airflow"
               }
             }
           },
           {
             "url": "https://api.github.com/repos/yufeixin/airflow/pulls/305",
             "id": 581371796,
             "number": 305,
             "head": {
               "ref": "master",
               "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
               "repo": {
                 "id": 33884891,
                 "url": "https://api.github.com/repos/apache/airflow",
                 "name": "airflow"
               }
             },
             "base": {
               "ref": "master",
               "sha": "f9cc775adc0dff49b0e288d8f2745bf097017321",
               "repo": {
                 "id": 320438482,
                 "url": "https://api.github.com/repos/yufeixin/airflow",
                 "name": "airflow"
               }
             }
           },
           {
             "url": "https://api.github.com/repos/ddelange/airflow/pulls/6",
             "id": 571967064,
             "number": 6,
             "head": {
               "ref": "master",
               "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
               "repo": {
                 "id": 33884891,
                 "url": "https://api.github.com/repos/apache/airflow",
                 "name": "airflow"
               }
             },
             "base": {
               "ref": "master",
               "sha": "f9c9e9c38f444a39987478f3d1a262db909de8c4",
               "repo": {
                 "id": 219798743,
                 "url": "https://api.github.com/repos/ddelange/airflow",
                 "name": "airflow"
               }
             }
           },
           {
             "url": "https://api.github.com/repos/weiplanet/airflow/pulls/1",
             "id": 562187084,
             "number": 1,
             "head": {
               "ref": "master",
               "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
               "repo": {
                 "id": 33884891,
                 "url": "https://api.github.com/repos/apache/airflow",
                 "name": "airflow"
               }
             },
             "base": {
               "ref": "master",
               "sha": "2345cd1f0337f4f06a2f2595a1a59c30dc4a1535",
               "repo": {
                 "id": 303579760,
                 "url": "https://api.github.com/repos/weiplanet/airflow",
                 "name": "airflow"
               }
             }
           },
           {
             "url": "https://api.github.com/repos/TestingCodeReview/airflow/pulls/1",
             "id": 503500807,
             "number": 1,
             "head": {
               "ref": "master",
               "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
               "repo": {
                 "id": 33884891,
                 "url": "https://api.github.com/repos/apache/airflow",
                 "name": "airflow"
               }
             },
             "base": {
               "ref": "master",
               "sha": "0f02e45b7e127da3539752607ab02347d9fcd733",
               "repo": {
                 "id": 169064440,
                 "url": "https://api.github.com/repos/TestingCodeReview/airflow",
                 "name": "airflow"
               }
             }
           },
           {
             "url": "https://api.github.com/repos/mehrdad-shokri/incubator-airflow/pulls/2",
             "id": 470767538,
             "number": 2,
             "head": {
               "ref": "master",
               "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
               "repo": {
                 "id": 33884891,
                 "url": "https://api.github.com/repos/apache/airflow",
                 "name": "airflow"
               }
             },
             "base": {
               "ref": "master",
               "sha": "5739ba28a7d8634ba3702d5fa39e5ef663f5bf7c",
               "repo": {
                 "id": 110538040,
                 "url": "https://api.github.com/repos/mehrdad-shokri/incubator-airflow",
                 "name": "incubator-airflow"
               }
             }
           }
         ],
         "app": {
           "id": 15368,
           "slug": "github-actions",
           "node_id": "MDM6QXBwMTUzNjg=",
           "owner": {
             "login": "github",
             "id": 9919,
             "node_id": "MDEyOk9yZ2FuaXphdGlvbjk5MTk=",
             "avatar_url": "https://avatars.githubusercontent.com/u/9919?v=4",
             "gravatar_id": "",
             "url": "https://api.github.com/users/github",
             "html_url": "https://github.com/github",
             "followers_url": "https://api.github.com/users/github/followers",
             "following_url": "https://api.github.com/users/github/following{/other_user}",
             "gists_url": "https://api.github.com/users/github/gists{/gist_id}",
             "starred_url": "https://api.github.com/users/github/starred{/owner}{/repo}",
             "subscriptions_url": "https://api.github.com/users/github/subscriptions",
             "organizations_url": "https://api.github.com/users/github/orgs",
             "repos_url": "https://api.github.com/users/github/repos",
             "events_url": "https://api.github.com/users/github/events{/privacy}",
             "received_events_url": "https://api.github.com/users/github/received_events",
             "type": "Organization",
             "site_admin": false
           },
           "name": "GitHub Actions",
           "description": "Automate your workflow from idea to production",
           "external_url": "https://help.github.com/en/actions",
           "html_url": "https://github.com/apps/github-actions",
           "created_at": "2018-07-30T09:30:17Z",
           "updated_at": "2019-12-10T19:04:12Z",
           "permissions": {
             "actions": "write",
             "checks": "write",
             "contents": "write",
             "deployments": "write",
             "issues": "write",
             "metadata": "read",
             "organization_packages": "write",
             "packages": "write",
             "pages": "write",
             "pull_requests": "write",
             "repository_hooks": "write",
             "repository_projects": "write",
             "security_events": "write",
             "statuses": "write",
             "vulnerability_alerts": "read"
           },
           "events": [
             "check_run",
             "check_suite",
             "create",
             "delete",
             "deployment",
             "deployment_status",
             "fork",
             "gollum",
             "issues",
             "issue_comment",
             "label",
             "milestone",
             "page_build",
             "project",
             "project_card",
             "project_column",
             "public",
             "pull_request",
             "pull_request_review",
             "pull_request_review_comment",
             "push",
             "registry_package",
             "release",
             "repository",
             "repository_dispatch",
             "status",
             "watch",
             "workflow_dispatch",
             "workflow_run"
           ]
         },
         "created_at": "2021-03-11T09:44:47Z",
         "updated_at": "2021-03-11T10:09:45Z"
       },
       "app": {
         "id": 15368,
         "slug": "github-actions",
         "node_id": "MDM6QXBwMTUzNjg=",
         "owner": {
           "login": "github",
           "id": 9919,
           "node_id": "MDEyOk9yZ2FuaXphdGlvbjk5MTk=",
           "avatar_url": "https://avatars.githubusercontent.com/u/9919?v=4",
           "gravatar_id": "",
           "url": "https://api.github.com/users/github",
           "html_url": "https://github.com/github",
           "followers_url": "https://api.github.com/users/github/followers",
           "following_url": "https://api.github.com/users/github/following{/other_user}",
           "gists_url": "https://api.github.com/users/github/gists{/gist_id}",
           "starred_url": "https://api.github.com/users/github/starred{/owner}{/repo}",
           "subscriptions_url": "https://api.github.com/users/github/subscriptions",
           "organizations_url": "https://api.github.com/users/github/orgs",
           "repos_url": "https://api.github.com/users/github/repos",
           "events_url": "https://api.github.com/users/github/events{/privacy}",
           "received_events_url": "https://api.github.com/users/github/received_events",
           "type": "Organization",
           "site_admin": false
         },
         "name": "GitHub Actions",
         "description": "Automate your workflow from idea to production",
         "external_url": "https://help.github.com/en/actions",
         "html_url": "https://github.com/apps/github-actions",
         "created_at": "2018-07-30T09:30:17Z",
         "updated_at": "2019-12-10T19:04:12Z",
         "permissions": {
           "actions": "write",
           "checks": "write",
           "contents": "write",
           "deployments": "write",
           "issues": "write",
           "metadata": "read",
           "organization_packages": "write",
           "packages": "write",
           "pages": "write",
           "pull_requests": "write",
           "repository_hooks": "write",
           "repository_projects": "write",
           "security_events": "write",
           "statuses": "write",
           "vulnerability_alerts": "read"
         },
         "events": [
           "check_run",
           "check_suite",
           "create",
           "delete",
           "deployment",
           "deployment_status",
           "fork",
           "gollum",
           "issues",
           "issue_comment",
           "label",
           "milestone",
           "page_build",
           "project",
           "project_card",
           "project_column",
           "public",
           "pull_request",
           "pull_request_review",
           "pull_request_review_comment",
           "push",
           "registry_package",
           "release",
           "repository",
           "repository_dispatch",
           "status",
           "watch",
           "workflow_dispatch",
           "workflow_run"
         ]
       },
       "pull_requests": [
         {
           "url": "https://api.github.com/repos/rpatil524/airflow/pulls/80",
           "id": 581521861,
           "number": 80,
           "head": {
             "ref": "master",
             "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
             "repo": {
               "id": 33884891,
               "url": "https://api.github.com/repos/apache/airflow",
               "name": "airflow"
             }
           },
           "base": {
             "ref": "master",
             "sha": "fff344443f764a0a5c9f5ba7dfbfa1e5e1115348",
             "repo": {
               "id": 336576614,
               "url": "https://api.github.com/repos/rpatil524/airflow",
               "name": "airflow"
             }
           }
         },
         {
           "url": "https://api.github.com/repos/yufeixin/airflow/pulls/305",
           "id": 581371796,
           "number": 305,
           "head": {
             "ref": "master",
             "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
             "repo": {
               "id": 33884891,
               "url": "https://api.github.com/repos/apache/airflow",
               "name": "airflow"
             }
           },
           "base": {
             "ref": "master",
             "sha": "f9cc775adc0dff49b0e288d8f2745bf097017321",
             "repo": {
               "id": 320438482,
               "url": "https://api.github.com/repos/yufeixin/airflow",
               "name": "airflow"
             }
           }
         },
         {
           "url": "https://api.github.com/repos/ddelange/airflow/pulls/6",
           "id": 571967064,
           "number": 6,
           "head": {
             "ref": "master",
             "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
             "repo": {
               "id": 33884891,
               "url": "https://api.github.com/repos/apache/airflow",
               "name": "airflow"
             }
           },
           "base": {
             "ref": "master",
             "sha": "f9c9e9c38f444a39987478f3d1a262db909de8c4",
             "repo": {
               "id": 219798743,
               "url": "https://api.github.com/repos/ddelange/airflow",
               "name": "airflow"
             }
           }
         },
         {
           "url": "https://api.github.com/repos/weiplanet/airflow/pulls/1",
           "id": 562187084,
           "number": 1,
           "head": {
             "ref": "master",
             "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
             "repo": {
               "id": 33884891,
               "url": "https://api.github.com/repos/apache/airflow",
               "name": "airflow"
             }
           },
           "base": {
             "ref": "master",
             "sha": "2345cd1f0337f4f06a2f2595a1a59c30dc4a1535",
             "repo": {
               "id": 303579760,
               "url": "https://api.github.com/repos/weiplanet/airflow",
               "name": "airflow"
             }
           }
         },
         {
           "url": "https://api.github.com/repos/TestingCodeReview/airflow/pulls/1",
           "id": 503500807,
           "number": 1,
           "head": {
             "ref": "master",
             "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
             "repo": {
               "id": 33884891,
               "url": "https://api.github.com/repos/apache/airflow",
               "name": "airflow"
             }
           },
           "base": {
             "ref": "master",
             "sha": "0f02e45b7e127da3539752607ab02347d9fcd733",
             "repo": {
               "id": 169064440,
               "url": "https://api.github.com/repos/TestingCodeReview/airflow",
               "name": "airflow"
             }
           }
         },
         {
           "url": "https://api.github.com/repos/mehrdad-shokri/incubator-airflow/pulls/2",
           "id": 470767538,
           "number": 2,
           "head": {
             "ref": "master",
             "sha": "9070ba280253e918c2582f52c2b77a07c311198f",
             "repo": {
               "id": 33884891,
               "url": "https://api.github.com/repos/apache/airflow",
               "name": "airflow"
             }
           },
           "base": {
             "ref": "master",
             "sha": "5739ba28a7d8634ba3702d5fa39e5ef663f5bf7c",
             "repo": {
               "id": 110538040,
               "url": "https://api.github.com/repos/mehrdad-shokri/incubator-airflow",
               "name": "incubator-airflow"
             }
           }
         }
       ]
     },
     "repository": {
       "id": 33884891,
       "node_id": "MDEwOlJlcG9zaXRvcnkzMzg4NDg5MQ==",
       "name": "airflow",
       "full_name": "apache/airflow",
       "private": false,
       "owner": {
         "login": "apache",
         "id": 47359,
         "node_id": "MDEyOk9yZ2FuaXphdGlvbjQ3MzU5",
         "avatar_url": "https://avatars.githubusercontent.com/u/47359?v=4",
         "gravatar_id": "",
         "url": "https://api.github.com/users/apache",
         "html_url": "https://github.com/apache",
         "followers_url": "https://api.github.com/users/apache/followers",
         "following_url": "https://api.github.com/users/apache/following{/other_user}",
         "gists_url": "https://api.github.com/users/apache/gists{/gist_id}",
         "starred_url": "https://api.github.com/users/apache/starred{/owner}{/repo}",
         "subscriptions_url": "https://api.github.com/users/apache/subscriptions",
         "organizations_url": "https://api.github.com/users/apache/orgs",
         "repos_url": "https://api.github.com/users/apache/repos",
         "events_url": "https://api.github.com/users/apache/events{/privacy}",
         "received_events_url": "https://api.github.com/users/apache/received_events",
         "type": "Organization",
         "site_admin": false
       },
       "html_url": "https://github.com/apache/airflow",
       "description": "Apache Airflow - A platform to programmatically author, schedule, and monitor workflows",
       "fork": false,
       "url": "https://api.github.com/repos/apache/airflow",
       "forks_url": "https://api.github.com/repos/apache/airflow/forks",
       "keys_url": "https://api.github.com/repos/apache/airflow/keys{/key_id}",
       "collaborators_url": "https://api.github.com/repos/apache/airflow/collaborators{/collaborator}",
       "teams_url": "https://api.github.com/repos/apache/airflow/teams",
       "hooks_url": "https://api.github.com/repos/apache/airflow/hooks",
       "issue_events_url": "https://api.github.com/repos/apache/airflow/issues/events{/number}",
       "events_url": "https://api.github.com/repos/apache/airflow/events",
       "assignees_url": "https://api.github.com/repos/apache/airflow/assignees{/user}",
       "branches_url": "https://api.github.com/repos/apache/airflow/branches{/branch}",
       "tags_url": "https://api.github.com/repos/apache/airflow/tags",
       "blobs_url": "https://api.github.com/repos/apache/airflow/git/blobs{/sha}",
       "git_tags_url": "https://api.github.com/repos/apache/airflow/git/tags{/sha}",
       "git_refs_url": "https://api.github.com/repos/apache/airflow/git/refs{/sha}",
       "trees_url": "https://api.github.com/repos/apache/airflow/git/trees{/sha}",
       "statuses_url": "https://api.github.com/repos/apache/airflow/statuses/{sha}",
       "languages_url": "https://api.github.com/repos/apache/airflow/languages",
       "stargazers_url": "https://api.github.com/repos/apache/airflow/stargazers",
       "contributors_url": "https://api.github.com/repos/apache/airflow/contributors",
       "subscribers_url": "https://api.github.com/repos/apache/airflow/subscribers",
       "subscription_url": "https://api.github.com/repos/apache/airflow/subscription",
       "commits_url": "https://api.github.com/repos/apache/airflow/commits{/sha}",
       "git_commits_url": "https://api.github.com/repos/apache/airflow/git/commits{/sha}",
       "comments_url": "https://api.github.com/repos/apache/airflow/comments{/number}",
       "issue_comment_url": "https://api.github.com/repos/apache/airflow/issues/comments{/number}",
       "contents_url": "https://api.github.com/repos/apache/airflow/contents/{+path}",
       "compare_url": "https://api.github.com/repos/apache/airflow/compare/{base}...{head}",
       "merges_url": "https://api.github.com/repos/apache/airflow/merges",
       "archive_url": "https://api.github.com/repos/apache/airflow/{archive_format}{/ref}",
       "downloads_url": "https://api.github.com/repos/apache/airflow/downloads",
       "issues_url": "https://api.github.com/repos/apache/airflow/issues{/number}",
       "pulls_url": "https://api.github.com/repos/apache/airflow/pulls{/number}",
       "milestones_url": "https://api.github.com/repos/apache/airflow/milestones{/number}",
       "notifications_url": "https://api.github.com/repos/apache/airflow/notifications{?since,all,participating}",
       "labels_url": "https://api.github.com/repos/apache/airflow/labels{/name}",
       "releases_url": "https://api.github.com/repos/apache/airflow/releases{/id}",
       "deployments_url": "https://api.github.com/repos/apache/airflow/deployments",
       "created_at": "2015-04-13T18:04:58Z",
       "updated_at": "2021-03-11T09:44:52Z",
       "pushed_at": "2021-03-11T09:58:35Z",
       "git_url": "git://github.com/apache/airflow.git",
       "ssh_url": "git@github.com:apache/airflow.git",
       "clone_url": "https://github.com/apache/airflow.git",
       "svn_url": "https://github.com/apache/airflow",
       "homepage": "https://airflow.apache.org/",
       "size": 111044,
       "stargazers_count": 20723,
       "watchers_count": 20723,
       "language": "Python",
       "has_issues": true,
       "has_projects": true,
       "has_downloads": true,
       "has_wiki": false,
       "has_pages": false,
       "forks_count": 8134,
       "mirror_url": null,
       "archived": false,
       "disabled": false,
       "open_issues_count": 987,
       "license": {
         "key": "apache-2.0",
         "name": "Apache License 2.0",
         "spdx_id": "Apache-2.0",
         "url": "https://api.github.com/licenses/apache-2.0",
         "node_id": "MDc6TGljZW5zZTI="
       },
       "forks": 8134,
       "open_issues": 987,
       "watchers": 20723,
       "default_branch": "master"
     },
     "organization": {
       "login": "apache",
       "id": 47359,
       "node_id": "MDEyOk9yZ2FuaXphdGlvbjQ3MzU5",
       "url": "https://api.github.com/orgs/apache",
       "repos_url": "https://api.github.com/orgs/apache/repos",
       "events_url": "https://api.github.com/orgs/apache/events",
       "hooks_url": "https://api.github.com/orgs/apache/hooks",
       "issues_url": "https://api.github.com/orgs/apache/issues",
       "members_url": "https://api.github.com/orgs/apache/members{/member}",
       "public_members_url": "https://api.github.com/orgs/apache/public_members{/member}",
       "avatar_url": "https://avatars.githubusercontent.com/u/47359?v=4",
       "description": ""
     },
     "sender": {
       "login": "ashb",
       "id": 34150,
       "node_id": "MDQ6VXNlcjM0MTUw",
       "avatar_url": "https://avatars.githubusercontent.com/u/34150?v=4",
       "gravatar_id": "",
       "url": "https://api.github.com/users/ashb",
       "html_url": "https://github.com/ashb",
       "followers_url": "https://api.github.com/users/ashb/followers",
       "following_url": "https://api.github.com/users/ashb/following{/other_user}",
       "gists_url": "https://api.github.com/users/ashb/gists{/gist_id}",
       "starred_url": "https://api.github.com/users/ashb/starred{/owner}{/repo}",
       "subscriptions_url": "https://api.github.com/users/ashb/subscriptions",
       "organizations_url": "https://api.github.com/users/ashb/orgs",
       "repos_url": "https://api.github.com/users/ashb/repos",
       "events_url": "https://api.github.com/users/ashb/events{/privacy}",
       "received_events_url": "https://api.github.com/users/ashb/received_events",
       "type": "User",
       "site_admin": false
     },
     "installation": {
       "id": 14172953,
       "node_id": "MDIzOkludGVncmF0aW9uSW5zdGFsbGF0aW9uMTQxNzI5NTM="
     }
   }</code></pre>
   </details>
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-798807906


   [The Workflow run](https://github.com/apache/airflow/actions/runs/650160036) is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r591415464



##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       Sure. This tests reads content of the airflow configuration and returns it to the caller. The test failed (when I run it in parallel) because the config was missing. The web api returns the "production" config - no matter if the UNIT_TEST_MODE is set to True or not. I have not yet had time to figure out why the "airflow.cfg" is missing when I run the test in parallel mode (it is there when we run tests sequentially). 
   
   I still have to figure it out (and planned to). My guess is that some factor influences the sequence of executing the tests and some other test simply deletes the configuration as part of its execution and the config never gets restored when the `test_configuration_expose_config' is executed. Unfortunately presence of config is 'implicit dependency' in many of our tests - the problem is that the config gets automatically created by airflow configuration code when it is missing, so if any test depend on its presence it might went unnoticed that the config is actually a side effect of either other tests or simply running airflow cli. 
   
   So I believe my fix is `proper` fix (but there are probably other tests that have similar side-effects dependency). Since this test depends on the 'airflow.cfg` being present, this should not be side-efffect, it should be explicitly initialised int 'setUp' of the test. Which is precisely what I do - I make sure that the "unit_test_mode" is set to false, so the "production" and not "unittest" config is initialized and then I Initialise it.
   
   I think this is how the test should be written in the first place.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r591416217



##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       I was thinking about separating out this into separate PR, which I will likely do in the evening when I switch to my personal PC.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-788222127


   We can always switch back. I implemented it in the way that 'parallel' mode is enabled for self-hosted runners but it can be switched off at any time by commenting out one line - see the PR.
   
   Regular PRs should work as previously. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r591351292



##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       @potiuk Can you explain why you are setting unit test mode to false here? At first glance that seems incorrect.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-803374395


   AH NOOO LASt rebase was simply taking the latest fixup :facepalm:  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-803296372


   And I will add `set -e` to fail on any error when runner gets initialized as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-794709877


   Before I merged this one I added one more change. This one should allow to dynamically choose (using gnu parallel --semaphore) how many tests we run in parallel depending on the number of CPUs (overrideable with `MAX_PARALLEL_TEST_JOBS` . This should allow to get ~20 minutes even with smaller machines I believe. Also it is very likely (testing it in https://github.com/potiuk/airflow/actions/runs/637728789 that even with default 2 CPUs for public GitHub Runners we should be able to achieve ~2x speedup. That would be awesome.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787506971


   Hey @ashb - we need bigger machines as I suspected :) . 
   
   The good news is that it will be much cheaper in the long run as we will need them for far less time.
   
   The tests are failing but mainly because of memory problems and timeouts (so I guess we are simply using too much of RAM , if we up the machine to 64 GB I think this should go rather smoothly. The good news is that even with not enough memory (and with failures/timeouts) the tests took ~26 m for `sqlite` - rather than > 1 h, so when we have enough memory we can achieve the 15 minutes I was hoping for. Those 64 GB machines are only a bit more expensive than the 32 GB ones, so we will save a lot of credits when it works. 
   
   We can even optimize it away a bit and have two self-hosted types:
   
   1) Big 64 GB ones for the tests
   2) Smaller 32 GB ones  for everything else
   
   It should be rather easy to configure in the `CI.yml`, but I am not sure if the auto-scaling solution we have will handle two types?
   
   Here is the job that we have partial successes/failures and it shows how those tests will look like. This is actually a good  one to show how the tests will look like.  You can see that the output is nicely grouped and you can see very clearly the monitoring and progress (it will be much nicer when we have more memory because each test will progress much faster). Also I print summary of the failed tests at the end - only "failure" outputs are fully printed to the logs at the end with "Red" groups - this will make it far easier to analyze problems (the same kind of output improvement is in the sequential version of the tests run on GitHub runners).
   
   In this  case three test types succeeded (`Heisentests Core Providers`) and remaining 5 had some failures (most of them from what I see is due to timeouts, which is perfectly understandable if we run out of memory and started to swap out to remote SSD in the cloud): https://github.com/apache/airflow/pull/14531/checks?check_run_id=1999175947. 
   
   Rationale for bigger machines:
   
   From what I see we have machines with 32 GB and since half of it will easily be eaten by `tmpfs` when we start writing logs and the like, we only have ~16 GB which is not enough. During my tests: https://twitter.com/higrys/status/1366037359461101569/photo/1 all the tests running in parallel took ~35 GB of memory on my 64 GB machine. I had just local SSD not `tmpfs` for those tests, but i do not think we need 30 GB tmpfs for all logs, docker, tmp etc (and we can fine tune that if we do).  
   
   Also it is more important than before to clean-up the `tmpfs` volumes before each run and make them "pristine" for every run - because we will be using nearly all of it. I think that will also help with cases like #14505 where some left-overs from previous runs are causing the jobs to fail.
   
   ```
                         total        used        free      shared  buff/cache   available
     Mem:           30Gi       696Mi        24Gi       3.4Gi       5.5Gi        26Gi
     Swap:            0B          0B          0B
   
     Filesystem      Size  Used Avail Use% Mounted on
     /dev/root       7.7G  2.6G  5.2G  34% /
     devtmpfs         16G     0   16G   0% /dev
     tmpfs            16G     0   16G   0% /dev/shm
     tmpfs           3.1G  804K  3.1G   1% /run
     tmpfs           5.0M     0  5.0M   0% /run/lock
     tmpfs            16G     0   16G   0% /sys/fs/cgroup
     tmpfs           3.1G  168K  3.1G   1% /tmp
     tmpfs            21G  2.9G   18G  15% /var/lib/docker
     tmpfs            16G  534M   15G   4% /home/runner/actions-runner/_work
     /dev/loop0       98M   98M     0 100% /snap/core/10185
     /dev/loop1       56M   56M     0 100% /snap/core18/1885
     /dev/loop2       71M   71M     0 100% /snap/lxd/16922
     /dev/loop3       29M   29M     0 100% /snap/amazon-ssm-agent/2012
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-792418265


   I upgraded the machines and run the tests. There are few flaky ones (one of them should be removed in the current build. But the results are exactly as I expected:
   
   https://github.com/apache/airflow/runs/2052982344?check_suite_focus=true
   
   * Full tests suite: : 1h 2m  -> 15 m - 4x improvement (!)
   * Static checks: 20% improvment: 9m -> 6m (we have more CPUS)
   * Pylint checks: 12m -> 7m20s (35% improvement - also more CPUS) 
   
   * Docs: unfortunately (unfortunatley as expected no improvments - 27 minutes). Building docs is something that we have to parallelise, but when we do - we will get below 8 minutes. 
   
   Once we merge it, building docs will be come the main boittleneck.
   
   I'd say - let's merge it as soon as possible - not only we will get much faster response but also it will be cheaper (for the longest and repeted in matrix builds we need 2x as expensive machines for 4x less time.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789392406


   > Have you double checked the build time with this change on one of the proposed AWS instances? I'm nervous of blowing throw our AWS credits in 1 month if this change _doesn't_ help.
   
   Yes. I already explained it when I created the PR (just scroll the PR few pages up).: https://github.com/apache/airflow/pull/14531#issuecomment-787506971 but I can explain it again and again, and again no problem. I am super patient person, though sometimes I am quite surprised that I have to answer questions that already have been answered (and once that you even commented on).
   
   This is not the first time over the last few weeks when I shared the data with you. The aim of this change (as I also explained already several time is to actually drive the cost down). This was my goal from the very beginning and I even did some initial estimations which I shared with you in this document when I actually run those tests manually https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#heading=h.mkysbyj2zg1t  - you even commented on that document. So answering your short question in long form - yes I did, and you had all chances to see all the results before. 
   
   With the current 32 GB instances the whole suite of tests runs 26 minutes instead of > 1 hour with this solution. It fails and runs slower than it can because it swaps out memory (it needs around 35 GB of memory when I tested it locally)  so it will be much faster when we have bigger memory. My rough estimations is that it will run for ~ 15 minutes. But for sure it will not run for > 30 minutes which means that we save at least 6 build hours with every PR.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-803311241


   Oh could be yes.
   
   The other option we could do in the init scripts is to not call `docker login` but simply drop the .docker/config.json in place.
   
   An additional option: check the login still exists in the clean script (this is set as a ExecPost run in the action.runner.service file from the userdata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r585992580



##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       Is this needed?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-790050472


   > Not once in that comment you linked to did you say that you tested it on AWS.
   
   The whole comment was after the PR failed. I is exclusively about the tests that are failing in this PR and justifying the need why we need bigger than 32 - II even copy&pasted the output from the configuration printed by the failing tests and the machines in AWS. There. And I explained that it takes 26 minutes to fail (in those very tests which were originated by the PR). 
   
   And THEN I explained why I believe 64 GB as needed (because corresponding tests were taking 35 GB on my machine).
   
   Here is the beginning of the comment. All of it refers to tests that failed on our self-hosted architecture on 32 GB machine. I re-read it again, and I think you cannot read it differently. I am just explaining why we need bigger machines for our tests (which are - obviously running on our AWS infrastructure).
   
   > Hey @ashb - we need bigger machines as I suspected :) .
   
   > The good news is that it will be much cheaper in the long run as we will need them for far less time.
   
   > The tests are failing but mainly because of memory problems and timeouts (so I guess we are simply using too much of RAM , if we up the machine to 64 GB I think this should go rather smoothly. The good news is that even with not enough memory (and with failures/timeouts) the tests took ~26 m (!) for sqlite - rather than > 1 h, so when we have enough memory we can achieve the 15 minutes I was hoping for. Those 64 GB machines are only a bit more expensive than the 32 GB ones, so we will save a lot of credits when it works.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
dimberman commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-788073646


   @potiuk if we're running into this issue on self hosted, is there a chance this will cause problems on regular CI runs?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-796080875


   No 137s but few mysterious 'runner failed' without logs in public runners - i have to take a closer look at those. The predictable running sequence or maybe special exception for integration tests run independently might help with that. Integration tests take much more memory than others but they are short and i might want to make sure they are never run in parallel with other test types of we have little memory available.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787510745


   > but I am not sure if the auto-scaling solution we have will handle two types
   
   Not yet, but it could. We'd need two ASGs, and the logic for which one to shake up would need encoding in the lambda.
   
   Some of the upload steps can be reduced by moving everything to us-east-* regions too - closer to Azure where GitHub runs, and closer to docker.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789393377


   > If yes, then the Launch Template in eu-central-1 defines the instance size, and the user-data defines how big we mount the /var/lib/docker -- cloud-init.yaml lives in the airflow-ci-infra repo and (for now) needs to be manually updated in AWS on change. Command is in the commit that added it.
   
   Do I need to do anything - like restart the machines etc. to be sure that new changes are in effect? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787506971


   Hey @ashb - we need bigger machines as I suspected :) . 
   
   The good news is that it will be much cheaper in the long run as we will need them for far less time.
   
   The tests are failing but mainly because of memory problems and timeouts (so I guess we are simply using too muh , if we up the machine to 64 GB I think this should go rather smoothly. The good news is that even with not enough memory (and with failures/timeouts) the tests took ~26 m for sqlite - rather than > 1h, so when we have enough memory we can achieve the 15 minutes I was hoping for. Those 64GB machines are only a bit more expensive than the 32GB ones, so we will save a lot of credits when it works. 
   
   We can even optimize it away a bit and have two self-hosted types:
   
   1) Big 64GB ones for the tests
   2) Smaller 32GB ones  for everything else
   
   It should be rather easy to configure in the CI.yml , but I am not sure if the auto-scaling solution we have will handle two types?
   
   Here is the job that we have partial successes/failures and it shows how those tests will look like. This is actually a good  one to show how the tests will look like.  You can see that the output is nicely grouped and you can see very clearly the monitoring and progress (it will be much nicer when we have more memory because each test will progress much faster). Also I print summary of the failed tests at the end - only "failure" outputs are fully printed to the logs at the end with "Red" groups - this will make it far easier to analyze problems (the same kind of output improvement is in the sequential version of the tests run on GitHub runners).
   
   In this  case three test types succeeded (`Heisentests Core Providers`) and remaining 5 had some failures (most of them from what I see is due to timeouts, which is perfectly understandable if we run out of memory and started to swap out to remote SSD in the cloud): https://github.com/apache/airflow/pull/14531/checks?check_run_id=1999175947. 
   
   Rationale for bigger machines:
   
   From what I see we have machines with 32GB and since half of it will easily be eaten by `tmpfs` when we start writing logs and the like, we only have ~16GB which is not enough. During my tests: https://twitter.com/higrys/status/1366037359461101569/photo/1 all the tests running in parallell took ~35GB of memory on my 64 GB machine (I had just local SSD not tmpfs for that but i do not think we need 35GB tmpfs for all logs, docker, tmp etc (and we can fine tune that if we do).  
   
   Also it is more important than before to clean-up the tmpfs volumes before each run and make them "pristine" for every run - because we will be using nearly all of it. I think that will also help with cases like #14505 where some left-overs from previous runs are causing the jobs to fail.
   
   ```
                         total        used        free      shared  buff/cache   available
     Mem:           30Gi       696Mi        24Gi       3.4Gi       5.5Gi        26Gi
     Swap:            0B          0B          0B
   
     Filesystem      Size  Used Avail Use% Mounted on
     /dev/root       7.7G  2.6G  5.2G  34% /
     devtmpfs         16G     0   16G   0% /dev
     tmpfs            16G     0   16G   0% /dev/shm
     tmpfs           3.1G  804K  3.1G   1% /run
     tmpfs           5.0M     0  5.0M   0% /run/lock
     tmpfs            16G     0   16G   0% /sys/fs/cgroup
     tmpfs           3.1G  168K  3.1G   1% /tmp
     tmpfs            21G  2.9G   18G  15% /var/lib/docker
     tmpfs            16G  534M   15G   4% /home/runner/actions-runner/_work
     /dev/loop0       98M   98M     0 100% /snap/core/10185
     /dev/loop1       56M   56M     0 100% /snap/core18/1885
     /dev/loop2       71M   71M     0 100% /snap/lxd/16922
     /dev/loop3       29M   29M     0 100% /snap/amazon-ssm-agent/2012
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-792633011


   Everyone - please take a look and comment/approve.
   
   Temporarily I switched the instance template back to `.xlarge` instance to save cost before this one gets merged. I will also (in the evening) look at the two tests that seems to be failing / flaky:
   
   * deadlock in backfill 
   * K8s test taking too long to launch POD
   
   Any help appreciated - if you saw any of those before, happy to hear it.
   
   Before merging this one I want to make sure it is rock-solid.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r589132393



##########
File path: .github/workflows/ci.yml
##########
@@ -35,6 +35,8 @@ env:
   DB_RESET: "true"
   VERBOSE: "true"
   DOCKER_CACHE: "pulled"
+  # Do not show all output lines on CI. Only output of the failed tests will be shown as a summary
+  TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR: "false"

Review comment:
       Removed this variable altogether.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789386354


   > Have you looked at pytest-xdist?
   
   As I already explained several times - yes, we looked at this several times in the past  by different people. First time 2 years ago and we got scared by the results. The result were always the same - it failed big time. You can try yourself if you want.
   
   Our tests (big number of them)  rely on shared database which is reset and reinitialized and filled wth data/cleaned by multiple tests. The shared database is shared state resource and if we try to run tests in parallel using xdist, they override each other data and fail completely randomly. There is no way to run parallel tests sharing the same database (unless we completely redefine our tests and always mock the database or somehow isolate all the tests that use the DB. 
   
   Particularly quite  lot of tests from the core (including the scheduler ones - I guess you are familiar with those) actually use the DB and are not ready to be run in parallel. I think you can imagine best what starts happening if you run those multiple scheduler tests against the same database in parallel. But those are not the only ones. Airflow test suite encourages people to use the DB during their testing.
   
   There are also a number of tests that rely on side effects from other tests (stored in the same database).  
   
   My solution solves it in the way that every type of tests has its own database (and each group tests are run sequentially). This way they cannot override each other data and since I already run each group separately in the 'sequential' approach, I know that side effects are at least 'under control' (i.e. they do not show up).
   
   Of course if we leave in ideal world we would have no side effects and no shared database, in which case pytest-xdist would work for us. But, unfortunately our small Airlfow world is not perfect. And fixing it to be perfect and ideal could likely take weeks or months of work by isolating, fixing and mocking out all the tests. It's a great regret of mine, but I believe noone in the community has time for it, so I prefer to implement what can give immediate effect and can be implemented with very small effort.
   
   But I do encourage you to take on the task and fix all the tests. That would be perfect to have it at some point in time. Also it would likely drive down the time needed to run the tests even more.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787506971


   Hey @ashb - we need bigger machines as I suspected :) . 
   
   The good news is that it will be much cheaper in the long run as we will need them for far less time.
   
   The tests are failing but mainly because of memory problems and timeouts (so I guess we are simply using too much of RAM , if we up the machine to 64 GB I think this should go rather smoothly. The good news is that even with not enough memory (and with failures/timeouts) the tests took ~26 m for `sqlite` - rather than > 1 h, so when we have enough memory we can achieve the 15 minutes I was hoping for. Those 64 GB machines are only a bit more expensive than the 32 GB ones, so we will save a lot of credits when it works. 
   
   We can even optimize it away a bit and have two self-hosted types:
   
   1) Big 64 GB ones for the tests
   2) Smaller 32 GB ones  for everything else
   
   It should be rather easy to configure in the `CI.yml`, but I am not sure if the auto-scaling solution we have will handle two types?
   
   Here is the job that we have partial successes/failures and it shows how those tests will look like. This is actually a good  one to show how the tests will look like.  You can see that the output is nicely grouped and you can see very clearly the monitoring and progress (it will be much nicer when we have more memory because each test will progress much faster). Also I print summary of the failed tests at the end - only "failure" outputs are fully printed to the logs at the end with "Red" groups - this will make it far easier to analyze problems (the same kind of output improvement is in the sequential version of the tests run on GitHub runners).
   
   In this  case three test types succeeded (`Heisentests Core Providers`) and remaining 5 had some failures (most of them from what I see is due to timeouts, which is perfectly understandable if we run out of memory and started to swap out to remote SSD in the cloud): https://github.com/apache/airflow/pull/14531/checks?check_run_id=1999175947. 
   
   Rationale for bigger machines:
   
   From what I see we have machines with 32 GB and since half of it will easily be eaten by `tmpfs` when we start writing logs and the like, we only have ~16 GB which is not enough. During my tests: https://twitter.com/higrys/status/1366037359461101569/photo/1 all the tests running in parallel took ~35 GB of memory on my 64 GB machine. I had just local SSD not `tmpfs` for those tests, but i do not think we need 30 GB tmpfs for all logs, docker, tmp etc (and we can fine tune that if we do).  
   
   Also it is more important than before to clean-up the `tmpfs` volumes before each run and make them "pristine" for every run - because we will be using nearly all of it. I think that will also help with cases like #14505 where some left-overs from previous runs are causing the jobs to fail.
   
   This is the mchine state before the tests are run: 
   
   ```
                         total        used        free      shared  buff/cache   available
     Mem:           30Gi       696Mi        24Gi       3.4Gi       5.5Gi        26Gi
     Swap:            0B          0B          0B
   
     Filesystem      Size  Used Avail Use% Mounted on
     /dev/root       7.7G  2.6G  5.2G  34% /
     devtmpfs         16G     0   16G   0% /dev
     tmpfs            16G     0   16G   0% /dev/shm
     tmpfs           3.1G  804K  3.1G   1% /run
     tmpfs           5.0M     0  5.0M   0% /run/lock
     tmpfs            16G     0   16G   0% /sys/fs/cgroup
     tmpfs           3.1G  168K  3.1G   1% /tmp
     tmpfs            21G  2.9G   18G  15% /var/lib/docker
     tmpfs            16G  534M   15G   4% /home/runner/actions-runner/_work
     /dev/loop0       98M   98M     0 100% /snap/core/10185
     /dev/loop1       56M   56M     0 100% /snap/core18/1885
     /dev/loop2       71M   71M     0 100% /snap/lxd/16922
     /dev/loop3       29M   29M     0 100% /snap/amazon-ssm-agent/2012
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789386354


   > Have you looked at pytest-xdist?
   
   As I already explained several times - yes, we looked at this several times in the past  by different people. First time 2 years ago and we got scared by the results. The result were always the same - it failed big time. You can try yourself if you want.
   
   Our tests (big number of them)  rely on shared database which is reset and reinitialized and filled wth data/cleaned by multiple tests. The shared database is shared state resource and if we try to run tests in parallel using xdist, they override each other data and fail completely randomly. There is no way to run parallel tests sharing the same database (unless we completely redefine our tests and always mock the database or somehow isolate all the tests that use the DB. 
   
   Particularly quite  lot of tests from the core (including the scheduler ones - I guess you are familiar with those) actually use the DB and are not ready to be run in parallel. I think you can imagine best what starts happening if you run those multiple scheduler tests against the same database in parallel. But those are not the only ones. Airflow test suite encourages people to use the DB during their testing.
   
   There are also a number of tests that rely on side effects from other tests (stored in the same database).  
   
   My solution solves it in the way that every type of tests has its own database (and in each group tests are run sequentially). Then  - groups are run in parallel. This way they cannot override each other data and since I already run each group separately in the 'sequential' approach, I know that side effects are at least 'under control' (i.e. they do not show up).
   
   Of course if we leave in ideal world we would have no side effects and no shared database, in which case pytest-xdist would work for us. But, unfortunately our small Airflow world is not perfect. And fixing it to be perfect and ideal could likely take weeks or months of work by isolating, fixing and mocking out all the tests. It's a great regret of mine, but I believe noone in the community has time for it, so I prefer to implement what can give immediate effect and can be implemented with very small effort.
   
   But I do encourage you to take on the task and fix all the tests. That would be perfect to have it at some point in time. Also it would likely drive down the time needed to run the tests even more.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r586410651



##########
File path: scripts/ci/testing/ci_run_airflow_testing.sh
##########
@@ -170,5 +161,282 @@ do
     echo "**********************************************************************************************"
 
     run_airflow_testing_in_docker "${@}"
+}
+
+
+# Cleans up Docker before test execution. System prune should clean all the temporary/unnamed images
+# And left-over volumes
+function cleanup_docker() {
+    echo
+    echo "System-prune docker"
+    echo
+    docker system prune --force --volumes
+    echo
+    echo "Check available space"
+    echo
+    df --human
+    echo
+    echo "Check available memory"
+    echo
+    free --human
+}
+
+# Prints status of a parallel test type. Returns 0 if the test type is still running, 1 otherwise
+#
+# $1 - Test type
+#
+# The test_pids variable should contain mapping of test types to pids
+# NOTE: It might be surprise that local variables (like test_pids) are visible to called functions
+#       But this is how local variables in bash work (by design).
+function print_test_type_status() {
+    local test_type=$1
+    local pid="${test_pids[${test_type}]}"
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+    if ps -p "${pid}" >/dev/null 2>/dev/null; then
+        if [[ ${TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR} != "true" ]]; then
+            echo "${COLOR_BLUE}Test type ${test_type} in progress (last ${TEST_SHOW_OUTPUT_LINES} lines of the output shown).${COLOR_RESET}"
+            echo
+            tail -${TEST_SHOW_OUTPUT_LINES} "${log_file}"
+            echo
+            echo
+        fi
+        return 0
+    else
+        if wait "${pid}"; then
+            echo
+            echo "${COLOR_GREEN}Test type: ${test_type} succeeded.${COLOR_RESET}"
+            echo
+            return 1
+        else
+            echo
+            echo "${COLOR_RED}Test type: ${test_type} failed.${COLOR_RESET}"
+            echo
+            return 1
+        fi
+    fi
+}
+
+# Prints status of all parallel test types. Returns 0 if any of the tests are still running, 1 otherwise
+function print_all_test_types_status() {
+    local still_running="false"
+    local test_type
+    echo "${COLOR_YELLOW}Progress for all tests: ${COLOR_BLUE}${TEST_TYPES}${COLOR_RESET}"
+    echo
+    for test_type in ${TEST_TYPES}
+    do
+        if print_test_type_status "${test_type}" ; then
+            still_running="true"
+        fi
+    done
+    if [[ ${still_running} == "true" ]]; then
+        return 0
+    fi
+    return 1
+}
+
+# Starts all test types in parallel
+# TEST_TYPES - list of all test types (it's not an array, it is space-separate list)
+# ${@} - additional arguments to pass to test execution
+function start_test_types_in_parallel() {
+    local test_type
+    local database_port
+    start_end::group_start "Starting parallel tests: ${TEST_TYPES}"
+    for test_type in ${TEST_TYPES}
+    do
+        # Grabbing a free random port to use for database forwarding
+        database_port="$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1])')";
+        echo
+        echo "Grabbed database port for forwarding: ${database_port}"
+        log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+        TEST_TYPE=${test_type} MYSQL_HOST_PORT="${database_port}" POSTGRES_HOST_PORT="${database_port}" \
+            run_single_test_type "${@}" >"${log_file}" 2>&1 &

Review comment:
       Nice, I didn't know you could "background" a function call like this.
   
   Curiously it appears that `$$` is "wrong" in such a case. (Doesn't affect us, I was just testing this out):
   
   ```bash
   function run_it {
     echo $$ sub-started
     sleep 10
     echo $$ finished
   } 
   
   echo $$ started
   run_it &
   pid=$!
   echo wating for $pid
   wait $pid
   ```
   
   Outputs this:
   ```
   410386 started
   wating for 410387
   410386 sub-started
   410386 finished
   ```
   

##########
File path: scripts/ci/testing/ci_run_airflow_testing.sh
##########
@@ -170,5 +161,282 @@ do
     echo "**********************************************************************************************"
 
     run_airflow_testing_in_docker "${@}"
+}
+
+
+# Cleans up Docker before test execution. System prune should clean all the temporary/unnamed images
+# And left-over volumes
+function cleanup_docker() {
+    echo
+    echo "System-prune docker"
+    echo
+    docker system prune --force --volumes
+    echo
+    echo "Check available space"
+    echo
+    df --human
+    echo
+    echo "Check available memory"
+    echo
+    free --human
+}
+
+# Prints status of a parallel test type. Returns 0 if the test type is still running, 1 otherwise
+#
+# $1 - Test type
+#
+# The test_pids variable should contain mapping of test types to pids
+# NOTE: It might be surprise that local variables (like test_pids) are visible to called functions
+#       But this is how local variables in bash work (by design).
+function print_test_type_status() {
+    local test_type=$1
+    local pid="${test_pids[${test_type}]}"
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+    if ps -p "${pid}" >/dev/null 2>/dev/null; then
+        if [[ ${TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR} != "true" ]]; then
+            echo "${COLOR_BLUE}Test type ${test_type} in progress (last ${TEST_SHOW_OUTPUT_LINES} lines of the output shown).${COLOR_RESET}"
+            echo
+            tail -${TEST_SHOW_OUTPUT_LINES} "${log_file}"
+            echo
+            echo
+        fi
+        return 0
+    else
+        if wait "${pid}"; then
+            echo
+            echo "${COLOR_GREEN}Test type: ${test_type} succeeded.${COLOR_RESET}"
+            echo
+            return 1
+        else
+            echo
+            echo "${COLOR_RED}Test type: ${test_type} failed.${COLOR_RESET}"
+            echo
+            return 1
+        fi
+    fi
+}
+
+# Prints status of all parallel test types. Returns 0 if any of the tests are still running, 1 otherwise
+function print_all_test_types_status() {
+    local still_running="false"
+    local test_type
+    echo "${COLOR_YELLOW}Progress for all tests: ${COLOR_BLUE}${TEST_TYPES}${COLOR_RESET}"
+    echo
+    for test_type in ${TEST_TYPES}
+    do
+        if print_test_type_status "${test_type}" ; then
+            still_running="true"
+        fi
+    done
+    if [[ ${still_running} == "true" ]]; then
+        return 0
+    fi
+    return 1
+}
+
+# Starts all test types in parallel
+# TEST_TYPES - list of all test types (it's not an array, it is space-separate list)
+# ${@} - additional arguments to pass to test execution
+function start_test_types_in_parallel() {
+    local test_type
+    local database_port
+    start_end::group_start "Starting parallel tests: ${TEST_TYPES}"
+    for test_type in ${TEST_TYPES}
+    do
+        # Grabbing a free random port to use for database forwarding
+        database_port="$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1])')";
+        echo
+        echo "Grabbed database port for forwarding: ${database_port}"
+        log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+        TEST_TYPE=${test_type} MYSQL_HOST_PORT="${database_port}" POSTGRES_HOST_PORT="${database_port}" \
+            run_single_test_type "${@}" >"${log_file}" 2>&1 &
+        pid=$!
+        echo "The process started for ${test_type}. Log file ${log_file}"
+        echo
+        test_pids[${test_type}]=${pid}
+    done
+    start_end::group_end
+}
+
+# Monitors progress of all test types running
+# TEST_TYPES - all test types to monitor
+# test_pids - map of test types to PIDs
+function monitor_test_types() {
+    start_end::group_start "Monitoring progress of running parallel tests: ${TEST_TYPES}"
+    while true;
+    do
+        sleep "${TEST_SLEEP_TIME}"
+        if ! print_all_test_types_status; then
+            break
+        fi
+    done
+    start_end::group_end
+}
+
+
+# Monitors progress of single test type running
+# $1 - test type to monitor
+# test_pids - map of test types to PIDs
+function monitor_sequential_test_type() {
+    local test_type=$1
+    start_end::group_start "Monitoring progress of running sequential test type: $1"
+    while true;
+    do
+        sleep "${TEST_SLEEP_TIME}"
+        if ! print_test_type_status "${test_type}"; then
+            break
+        fi
+    done
+    start_end::group_end
+}
+
+# Checks status for all test types running. Returns 0 if any of the tests failed (with the exception of
+# Quarantined test.
+#
+# TEST_TYPES - all test types to run
+#    test_pids - map of test types to PIDs
+#
+# output:
+#   successful_tests - array of test types that succeeded
+#   failed_tests - array of test types that failed
+function check_test_types_status() {
+    start_end::group_start "Checking status for parallel tests: ${TEST_TYPES}"
+    local test_type_exit_code="0"
+    for test_type in "${!test_pids[@]}"; do
+        local pid="${test_pids[${test_type}]}"
+        wait "${pid}"
+        local temp_exit_code=$?
+        if [[ ${temp_exit_code} == "0" ]]; then
+            successful_tests+=("${test_type}")
+        else
+            failed_tests+=("${test_type}")
+            if [[ ${test_type} != "Quarantined" ]]; then
+                test_type_exit_code=${temp_exit_code}``
+            fi
+        fi
+    done
+    start_end::group_end
+    return "${test_type_exit_code}"
+}
+
+# Outputs logs for failed test type
+# $1 test type
+function output_log_for_failed_test_type(){
+    local failed_test_type=$1
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${failed_test_type}.log"
+    start_end::group_start "${COLOR_RED}Output for failed ${failed_test_type} from log: ${log_file}${COLOR_RESET}"
+    echo "${COLOR_RED}##### Test type ${failed_test_type} failed ##### ${COLOR_RESET}"
+    echo
+    cat "${log_file}"
+    echo
+    echo "${COLOR_RED}##### Test type ${failed_test_type} failed ##### ${COLOR_RESET}"
+    echo
+    start_end::group_end
+}
+
+function print_test_summary() {
+    if [[ ${#successful_tests[@]} != 0 ]]; then
+        echo
+        echo "${COLOR_GREEN}Successful test types: ${successful_tests[*]} ${COLOR_RESET}"
+        echo
+    fi
+    if [[ ${#failed_tests[@]} != 0 ]]; then
+        echo
+        echo "${COLOR_RED}Failed test types: ${failed_tests[*]}${COLOR_RESET}"
+        echo
+        for test_type in "${failed_tests[@]}"
+        do
+            output_log_for_failed_test_type "${test_type}"
+        done
+    fi
+}
+
+# Runs all test types in parallel, monitors their progress and summarizes the result when finished.
+function run_all_test_types_in_parallel() {
+    local successful_tests=()
+    local failed_tests=()
+    local exit_code="0"
+    local log_file
+    local test_type
+    local test_pids
+    declare -A test_pids
+
+    start_end::group_start "Cleanup docker before parallel tests: ${TEST_TYPES}."
+    cleanup_docker
     start_end::group_end
-done
+
+    start_test_types_in_parallel "${@}"
+
+    set +e
+    monitor_test_types
+
+    check_test_types_status
+    exit_code=$?
+    set -e
+    print_test_summary
+    return ${exit_code}
+}
+
+function run_all_test_types_sequentially() {
+    local exit_code="0"
+    local successful_tests=()
+    local failed_tests=()
+    local log_file
+    local test_type
+    local pid
+    local test_pids
+    declare -A test_pids

Review comment:
       ```suggestion
       local -A test_pids
   ```
   (or `declare -A test_pids`) We don't need both:
   
   > When used in a function, declare and typeset make each name local, as with the local command, unless the -g option is supplied.

##########
File path: .github/workflows/ci.yml
##########
@@ -35,6 +35,8 @@ env:
   DB_RESET: "true"
   VERBOSE: "true"
   DOCKER_CACHE: "pulled"
+  # Do not show all output lines on CI. Only output of the failed tests will be shown as a summary
+  TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR: "false"

Review comment:
       Do we want this configurable at all? If we are running a single test (i.e. in sequential mode) then we still want output to show up, and if we are running parallel tests then we can't have this and force it off anyway.

##########
File path: scripts/ci/testing/ci_run_airflow_testing.sh
##########
@@ -170,5 +161,282 @@ do
     echo "**********************************************************************************************"
 
     run_airflow_testing_in_docker "${@}"
+}
+
+
+# Cleans up Docker before test execution. System prune should clean all the temporary/unnamed images
+# And left-over volumes
+function cleanup_docker() {
+    echo
+    echo "System-prune docker"
+    echo
+    docker system prune --force --volumes
+    echo
+    echo "Check available space"
+    echo
+    df --human
+    echo
+    echo "Check available memory"
+    echo
+    free --human
+}
+
+# Prints status of a parallel test type. Returns 0 if the test type is still running, 1 otherwise
+#
+# $1 - Test type
+#
+# The test_pids variable should contain mapping of test types to pids
+# NOTE: It might be surprise that local variables (like test_pids) are visible to called functions
+#       But this is how local variables in bash work (by design).
+function print_test_type_status() {
+    local test_type=$1
+    local pid="${test_pids[${test_type}]}"
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+    if ps -p "${pid}" >/dev/null 2>/dev/null; then
+        if [[ ${TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR} != "true" ]]; then
+            echo "${COLOR_BLUE}Test type ${test_type} in progress (last ${TEST_SHOW_OUTPUT_LINES} lines of the output shown).${COLOR_RESET}"
+            echo
+            tail -${TEST_SHOW_OUTPUT_LINES} "${log_file}"
+            echo
+            echo
+        fi
+        return 0
+    else
+        if wait "${pid}"; then
+            echo
+            echo "${COLOR_GREEN}Test type: ${test_type} succeeded.${COLOR_RESET}"
+            echo
+            return 1
+        else
+            echo
+            echo "${COLOR_RED}Test type: ${test_type} failed.${COLOR_RESET}"
+            echo
+            return 1
+        fi
+    fi
+}
+
+# Prints status of all parallel test types. Returns 0 if any of the tests are still running, 1 otherwise
+function print_all_test_types_status() {
+    local still_running="false"
+    local test_type
+    echo "${COLOR_YELLOW}Progress for all tests: ${COLOR_BLUE}${TEST_TYPES}${COLOR_RESET}"
+    echo
+    for test_type in ${TEST_TYPES}
+    do
+        if print_test_type_status "${test_type}" ; then
+            still_running="true"
+        fi
+    done
+    if [[ ${still_running} == "true" ]]; then
+        return 0
+    fi
+    return 1
+}
+
+# Starts all test types in parallel
+# TEST_TYPES - list of all test types (it's not an array, it is space-separate list)
+# ${@} - additional arguments to pass to test execution
+function start_test_types_in_parallel() {
+    local test_type
+    local database_port
+    start_end::group_start "Starting parallel tests: ${TEST_TYPES}"
+    for test_type in ${TEST_TYPES}
+    do
+        # Grabbing a free random port to use for database forwarding
+        database_port="$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1])')";
+        echo
+        echo "Grabbed database port for forwarding: ${database_port}"
+        log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+        TEST_TYPE=${test_type} MYSQL_HOST_PORT="${database_port}" POSTGRES_HOST_PORT="${database_port}" \
+            run_single_test_type "${@}" >"${log_file}" 2>&1 &
+        pid=$!
+        echo "The process started for ${test_type}. Log file ${log_file}"
+        echo
+        test_pids[${test_type}]=${pid}
+    done
+    start_end::group_end
+}
+
+# Monitors progress of all test types running
+# TEST_TYPES - all test types to monitor
+# test_pids - map of test types to PIDs
+function monitor_test_types() {
+    start_end::group_start "Monitoring progress of running parallel tests: ${TEST_TYPES}"
+    while true;
+    do
+        sleep "${TEST_SLEEP_TIME}"
+        if ! print_all_test_types_status; then
+            break
+        fi
+    done
+    start_end::group_end
+}
+
+
+# Monitors progress of single test type running
+# $1 - test type to monitor
+# test_pids - map of test types to PIDs
+function monitor_sequential_test_type() {
+    local test_type=$1
+    start_end::group_start "Monitoring progress of running sequential test type: $1"
+    while true;
+    do
+        sleep "${TEST_SLEEP_TIME}"
+        if ! print_test_type_status "${test_type}"; then
+            break
+        fi
+    done
+    start_end::group_end
+}
+
+# Checks status for all test types running. Returns 0 if any of the tests failed (with the exception of
+# Quarantined test.
+#
+# TEST_TYPES - all test types to run
+#    test_pids - map of test types to PIDs
+#
+# output:
+#   successful_tests - array of test types that succeeded
+#   failed_tests - array of test types that failed
+function check_test_types_status() {
+    start_end::group_start "Checking status for parallel tests: ${TEST_TYPES}"
+    local test_type_exit_code="0"
+    for test_type in "${!test_pids[@]}"; do
+        local pid="${test_pids[${test_type}]}"
+        wait "${pid}"
+        local temp_exit_code=$?
+        if [[ ${temp_exit_code} == "0" ]]; then
+            successful_tests+=("${test_type}")
+        else
+            failed_tests+=("${test_type}")
+            if [[ ${test_type} != "Quarantined" ]]; then
+                test_type_exit_code=${temp_exit_code}``
+            fi
+        fi
+    done
+    start_end::group_end
+    return "${test_type_exit_code}"
+}
+
+# Outputs logs for failed test type
+# $1 test type
+function output_log_for_failed_test_type(){
+    local failed_test_type=$1
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${failed_test_type}.log"
+    start_end::group_start "${COLOR_RED}Output for failed ${failed_test_type} from log: ${log_file}${COLOR_RESET}"
+    echo "${COLOR_RED}##### Test type ${failed_test_type} failed ##### ${COLOR_RESET}"
+    echo
+    cat "${log_file}"
+    echo
+    echo "${COLOR_RED}##### Test type ${failed_test_type} failed ##### ${COLOR_RESET}"
+    echo
+    start_end::group_end
+}
+
+function print_test_summary() {
+    if [[ ${#successful_tests[@]} != 0 ]]; then
+        echo
+        echo "${COLOR_GREEN}Successful test types: ${successful_tests[*]} ${COLOR_RESET}"
+        echo
+    fi
+    if [[ ${#failed_tests[@]} != 0 ]]; then
+        echo
+        echo "${COLOR_RED}Failed test types: ${failed_tests[*]}${COLOR_RESET}"
+        echo
+        for test_type in "${failed_tests[@]}"
+        do
+            output_log_for_failed_test_type "${test_type}"
+        done
+    fi
+}
+
+# Runs all test types in parallel, monitors their progress and summarizes the result when finished.
+function run_all_test_types_in_parallel() {
+    local successful_tests=()
+    local failed_tests=()
+    local exit_code="0"
+    local log_file
+    local test_type
+    local test_pids
+    declare -A test_pids
+
+    start_end::group_start "Cleanup docker before parallel tests: ${TEST_TYPES}."
+    cleanup_docker
     start_end::group_end
-done
+
+    start_test_types_in_parallel "${@}"
+
+    set +e
+    monitor_test_types
+
+    check_test_types_status
+    exit_code=$?
+    set -e
+    print_test_summary
+    return ${exit_code}
+}
+
+function run_all_test_types_sequentially() {
+    local exit_code="0"
+    local successful_tests=()
+    local failed_tests=()
+    local log_file
+    local test_type
+    local pid
+    local test_pids
+    declare -A test_pids
+    for test_type in ${TEST_TYPES}
+    do
+        start_end::group_start "Cleanup docker before ${test_type} test"
+        cleanup_docker
+        start_end::group_end
+        start_end::group_start "Running tests ${test_type}"
+        log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+        TEST_TYPE=${test_type} run_single_test_type "${@}"  >"${log_file}" 2>&1 &
+        pid=$!
+        echo "The process started for ${test_type}. Log file ${log_file}"
+        test_pids[${test_type}]=${pid}
+        if [[ ${TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR} == "true" ]]; then
+            # If all output is to be shown, start a tail process to show the logs
+            tail -f "${log_file}" &
+        fi
+        start_end::group_end
+        set +e
+        monitor_sequential_test_type "${test_type}"
+        wait "${pid}"
+        local temp_exit_code=$?
+        set -e
+        if [[ ${temp_exit_code} == "0" ]]; then
+            successful_tests+=("${test_type}")
+        else
+            if [[ ${test_type} != "Quarantined" ]]; then
+                exit_code=${temp_exit_code}
+            fi
+            failed_tests+=("${test_type}")
+        fi
+    done
+    print_test_summary
+    return ${exit_code}
+}
+
+build_images::prepare_ci_build
+
+build_images::rebuild_ci_image_if_needed_with_group
+
+prepare_tests_to_run
+
+RUN_TESTS="true"
+export RUN_TESTS
+
+if [[ ${RUN_TESTS_IN_PARALLEL} == "true" ]] ; then
+    if [[ ${TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR} == "true" ]]; then
+        echo
+        echo "In case of parallel tests we cannot mis those output lines as they appear. Disabling it!"

Review comment:
       ```suggestion
           echo "In case of parallel tests we cannot show output lines as they appear. Disabling it!"
   ```

##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       WWW tests right now are run in full isolation, and if I start a fresh breeze container on master and run `pytest test/www/` it passes, so I don't quite understand what you mean.

##########
File path: scripts/ci/testing/ci_run_airflow_testing.sh
##########
@@ -170,5 +161,282 @@ do
     echo "**********************************************************************************************"
 
     run_airflow_testing_in_docker "${@}"
+}
+
+
+# Cleans up Docker before test execution. System prune should clean all the temporary/unnamed images
+# And left-over volumes
+function cleanup_docker() {
+    echo
+    echo "System-prune docker"
+    echo
+    docker system prune --force --volumes
+    echo
+    echo "Check available space"
+    echo
+    df --human
+    echo
+    echo "Check available memory"
+    echo
+    free --human
+}
+
+# Prints status of a parallel test type. Returns 0 if the test type is still running, 1 otherwise
+#
+# $1 - Test type
+#
+# The test_pids variable should contain mapping of test types to pids
+# NOTE: It might be surprise that local variables (like test_pids) are visible to called functions
+#       But this is how local variables in bash work (by design).
+function print_test_type_status() {
+    local test_type=$1
+    local pid="${test_pids[${test_type}]}"
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+    if ps -p "${pid}" >/dev/null 2>/dev/null; then
+        if [[ ${TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR} != "true" ]]; then
+            echo "${COLOR_BLUE}Test type ${test_type} in progress (last ${TEST_SHOW_OUTPUT_LINES} lines of the output shown).${COLOR_RESET}"
+            echo
+            tail -${TEST_SHOW_OUTPUT_LINES} "${log_file}"
+            echo
+            echo
+        fi
+        return 0
+    else
+        if wait "${pid}"; then
+            echo
+            echo "${COLOR_GREEN}Test type: ${test_type} succeeded.${COLOR_RESET}"
+            echo
+            return 1
+        else
+            echo
+            echo "${COLOR_RED}Test type: ${test_type} failed.${COLOR_RESET}"
+            echo
+            return 1
+        fi
+    fi
+}
+
+# Prints status of all parallel test types. Returns 0 if any of the tests are still running, 1 otherwise
+function print_all_test_types_status() {
+    local still_running="false"
+    local test_type
+    echo "${COLOR_YELLOW}Progress for all tests: ${COLOR_BLUE}${TEST_TYPES}${COLOR_RESET}"
+    echo
+    for test_type in ${TEST_TYPES}
+    do
+        if print_test_type_status "${test_type}" ; then
+            still_running="true"
+        fi
+    done
+    if [[ ${still_running} == "true" ]]; then
+        return 0
+    fi
+    return 1
+}
+
+# Starts all test types in parallel
+# TEST_TYPES - list of all test types (it's not an array, it is space-separate list)
+# ${@} - additional arguments to pass to test execution
+function start_test_types_in_parallel() {
+    local test_type
+    local database_port
+    start_end::group_start "Starting parallel tests: ${TEST_TYPES}"
+    for test_type in ${TEST_TYPES}
+    do
+        # Grabbing a free random port to use for database forwarding
+        database_port="$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1])')";
+        echo
+        echo "Grabbed database port for forwarding: ${database_port}"
+        log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+        TEST_TYPE=${test_type} MYSQL_HOST_PORT="${database_port}" POSTGRES_HOST_PORT="${database_port}" \
+            run_single_test_type "${@}" >"${log_file}" 2>&1 &
+        pid=$!
+        echo "The process started for ${test_type}. Log file ${log_file}"
+        echo
+        test_pids[${test_type}]=${pid}
+    done
+    start_end::group_end
+}
+
+# Monitors progress of all test types running
+# TEST_TYPES - all test types to monitor
+# test_pids - map of test types to PIDs
+function monitor_test_types() {
+    start_end::group_start "Monitoring progress of running parallel tests: ${TEST_TYPES}"
+    while true;
+    do
+        sleep "${TEST_SLEEP_TIME}"
+        if ! print_all_test_types_status; then
+            break
+        fi
+    done
+    start_end::group_end
+}
+
+
+# Monitors progress of single test type running
+# $1 - test type to monitor
+# test_pids - map of test types to PIDs
+function monitor_sequential_test_type() {
+    local test_type=$1
+    start_end::group_start "Monitoring progress of running sequential test type: $1"
+    while true;
+    do
+        sleep "${TEST_SLEEP_TIME}"
+        if ! print_test_type_status "${test_type}"; then
+            break
+        fi
+    done
+    start_end::group_end
+}
+
+# Checks status for all test types running. Returns 0 if any of the tests failed (with the exception of
+# Quarantined test.
+#
+# TEST_TYPES - all test types to run
+#    test_pids - map of test types to PIDs
+#
+# output:
+#   successful_tests - array of test types that succeeded
+#   failed_tests - array of test types that failed
+function check_test_types_status() {
+    start_end::group_start "Checking status for parallel tests: ${TEST_TYPES}"
+    local test_type_exit_code="0"
+    for test_type in "${!test_pids[@]}"; do
+        local pid="${test_pids[${test_type}]}"
+        wait "${pid}"
+        local temp_exit_code=$?
+        if [[ ${temp_exit_code} == "0" ]]; then
+            successful_tests+=("${test_type}")
+        else
+            failed_tests+=("${test_type}")
+            if [[ ${test_type} != "Quarantined" ]]; then
+                test_type_exit_code=${temp_exit_code}``
+            fi
+        fi
+    done
+    start_end::group_end
+    return "${test_type_exit_code}"
+}
+
+# Outputs logs for failed test type
+# $1 test type
+function output_log_for_failed_test_type(){
+    local failed_test_type=$1
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${failed_test_type}.log"
+    start_end::group_start "${COLOR_RED}Output for failed ${failed_test_type} from log: ${log_file}${COLOR_RESET}"
+    echo "${COLOR_RED}##### Test type ${failed_test_type} failed ##### ${COLOR_RESET}"
+    echo
+    cat "${log_file}"
+    echo
+    echo "${COLOR_RED}##### Test type ${failed_test_type} failed ##### ${COLOR_RESET}"
+    echo
+    start_end::group_end
+}
+
+function print_test_summary() {
+    if [[ ${#successful_tests[@]} != 0 ]]; then
+        echo
+        echo "${COLOR_GREEN}Successful test types: ${successful_tests[*]} ${COLOR_RESET}"
+        echo

Review comment:
       I think we should have the output logs shown even in case of success, at least on CI.
   
   The reason I'd like this is I often use information about the timing, both the total run time, and the output from `--durations` to see when things have gotten slow/where possible cases for speeding up might be.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-792425028


   BTW. The one flaky test that fails (is a deadlock on "backfill_depends_on_past" - it occurs on both Postgres and MySQL and often enough to suggest that this is a real issue, but I am not sure how to fix it quickly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-799910644


   [The Workflow run](https://github.com/apache/airflow/actions/runs/656151895) is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r586588251



##########
File path: scripts/ci/testing/ci_run_airflow_testing.sh
##########
@@ -170,5 +161,282 @@ do
     echo "**********************************************************************************************"
 
     run_airflow_testing_in_docker "${@}"
+}
+
+
+# Cleans up Docker before test execution. System prune should clean all the temporary/unnamed images
+# And left-over volumes
+function cleanup_docker() {
+    echo
+    echo "System-prune docker"
+    echo
+    docker system prune --force --volumes
+    echo
+    echo "Check available space"
+    echo
+    df --human
+    echo
+    echo "Check available memory"
+    echo
+    free --human
+}
+
+# Prints status of a parallel test type. Returns 0 if the test type is still running, 1 otherwise
+#
+# $1 - Test type
+#
+# The test_pids variable should contain mapping of test types to pids
+# NOTE: It might be surprise that local variables (like test_pids) are visible to called functions
+#       But this is how local variables in bash work (by design).
+function print_test_type_status() {
+    local test_type=$1
+    local pid="${test_pids[${test_type}]}"
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+    if ps -p "${pid}" >/dev/null 2>/dev/null; then
+        if [[ ${TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR} != "true" ]]; then
+            echo "${COLOR_BLUE}Test type ${test_type} in progress (last ${TEST_SHOW_OUTPUT_LINES} lines of the output shown).${COLOR_RESET}"
+            echo
+            tail -${TEST_SHOW_OUTPUT_LINES} "${log_file}"
+            echo
+            echo
+        fi
+        return 0
+    else
+        if wait "${pid}"; then
+            echo
+            echo "${COLOR_GREEN}Test type: ${test_type} succeeded.${COLOR_RESET}"
+            echo
+            return 1
+        else
+            echo
+            echo "${COLOR_RED}Test type: ${test_type} failed.${COLOR_RESET}"
+            echo
+            return 1
+        fi
+    fi
+}
+
+# Prints status of all parallel test types. Returns 0 if any of the tests are still running, 1 otherwise
+function print_all_test_types_status() {
+    local still_running="false"
+    local test_type
+    echo "${COLOR_YELLOW}Progress for all tests: ${COLOR_BLUE}${TEST_TYPES}${COLOR_RESET}"
+    echo
+    for test_type in ${TEST_TYPES}
+    do
+        if print_test_type_status "${test_type}" ; then
+            still_running="true"
+        fi
+    done
+    if [[ ${still_running} == "true" ]]; then
+        return 0
+    fi
+    return 1
+}
+
+# Starts all test types in parallel
+# TEST_TYPES - list of all test types (it's not an array, it is space-separate list)
+# ${@} - additional arguments to pass to test execution
+function start_test_types_in_parallel() {
+    local test_type
+    local database_port
+    start_end::group_start "Starting parallel tests: ${TEST_TYPES}"
+    for test_type in ${TEST_TYPES}
+    do
+        # Grabbing a free random port to use for database forwarding
+        database_port="$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1])')";
+        echo
+        echo "Grabbed database port for forwarding: ${database_port}"
+        log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+        TEST_TYPE=${test_type} MYSQL_HOST_PORT="${database_port}" POSTGRES_HOST_PORT="${database_port}" \
+            run_single_test_type "${@}" >"${log_file}" 2>&1 &
+        pid=$!
+        echo "The process started for ${test_type}. Log file ${log_file}"
+        echo
+        test_pids[${test_type}]=${pid}
+    done
+    start_end::group_end
+}
+
+# Monitors progress of all test types running
+# TEST_TYPES - all test types to monitor
+# test_pids - map of test types to PIDs
+function monitor_test_types() {
+    start_end::group_start "Monitoring progress of running parallel tests: ${TEST_TYPES}"
+    while true;
+    do
+        sleep "${TEST_SLEEP_TIME}"
+        if ! print_all_test_types_status; then
+            break
+        fi
+    done
+    start_end::group_end
+}
+
+
+# Monitors progress of single test type running
+# $1 - test type to monitor
+# test_pids - map of test types to PIDs
+function monitor_sequential_test_type() {
+    local test_type=$1
+    start_end::group_start "Monitoring progress of running sequential test type: $1"
+    while true;
+    do
+        sleep "${TEST_SLEEP_TIME}"
+        if ! print_test_type_status "${test_type}"; then
+            break
+        fi
+    done
+    start_end::group_end
+}
+
+# Checks status for all test types running. Returns 0 if any of the tests failed (with the exception of
+# Quarantined test.
+#
+# TEST_TYPES - all test types to run
+#    test_pids - map of test types to PIDs
+#
+# output:
+#   successful_tests - array of test types that succeeded
+#   failed_tests - array of test types that failed
+function check_test_types_status() {
+    start_end::group_start "Checking status for parallel tests: ${TEST_TYPES}"
+    local test_type_exit_code="0"
+    for test_type in "${!test_pids[@]}"; do
+        local pid="${test_pids[${test_type}]}"
+        wait "${pid}"
+        local temp_exit_code=$?
+        if [[ ${temp_exit_code} == "0" ]]; then
+            successful_tests+=("${test_type}")
+        else
+            failed_tests+=("${test_type}")
+            if [[ ${test_type} != "Quarantined" ]]; then
+                test_type_exit_code=${temp_exit_code}``
+            fi
+        fi
+    done
+    start_end::group_end
+    return "${test_type_exit_code}"
+}
+
+# Outputs logs for failed test type
+# $1 test type
+function output_log_for_failed_test_type(){
+    local failed_test_type=$1
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${failed_test_type}.log"
+    start_end::group_start "${COLOR_RED}Output for failed ${failed_test_type} from log: ${log_file}${COLOR_RESET}"
+    echo "${COLOR_RED}##### Test type ${failed_test_type} failed ##### ${COLOR_RESET}"
+    echo
+    cat "${log_file}"
+    echo
+    echo "${COLOR_RED}##### Test type ${failed_test_type} failed ##### ${COLOR_RESET}"
+    echo
+    start_end::group_end
+}
+
+function print_test_summary() {
+    if [[ ${#successful_tests[@]} != 0 ]]; then
+        echo
+        echo "${COLOR_GREEN}Successful test types: ${successful_tests[*]} ${COLOR_RESET}"
+        echo

Review comment:
       Yeah. That's a good reason to keep it - and on CI those are folded so yeah . we can always show it no problem.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r591415464



##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       Sure. This tests reads content of the airflow configuration and returns it to the caller. The test failed (when I run it in parallel) because the config was missing. The web api returns the "production" config - no matter if the UNIT_TEST_MODE is set to True or not. I have not yet had time to figure out why the "airflow.cfg" is missing when I run the test in parallel mode (it is there when we run tests sequentially). 
   
   I still have to figure it out (and planned to). My guess is that some factor influences the sequence of executing the tests and some other test simply deletes the configuration as part of its execution and the config never gets restored when the `test_configuration_expose_config' is executed. Unfortunately presence of config is 'implicit dependency' in many of our tests. The problem is that the config gets automatically created by airflow configuration code when it is missing. If any test depend on its presence it might went unnoticed that the config is actually a side effect of either other tests or simply running airflow cli. 
   
   So I believe my fix is `proper` fix (but there are probably other tests that have similar side-effects dependency). Since this test depends on the 'airflow.cfg` being present, this should not be side-efffect, it should be explicitly initialised int 'setUp' of the test. Which is precisely what I do - I make sure that the "unit_test_mode" is set to false, so the "production" and not the "unittest" config is initialized and then I Initialise it.
   
   I think this is how the test should be written in the first place.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787506971


   Hey @ashb - we need bigger machines as I suspected :) . 
   
   The good news is that it will be much cheaper in the long run as we will need them for far less time.
   
   The tests are failing but mainly because of memory problems and timeouts (so I guess we are simply using too much of RAM , if we up the machine to 64 GB I think this should go rather smoothly. The good news is that even with not enough memory (and with failures/timeouts) the tests took ~26 m for `sqlite` - rather than > 1 h, so when we have enough memory we can achieve the 15 minutes I was hoping for. Those 64 GB machines are only a bit more expensive than the 32 GB ones, so we will save a lot of credits when it works. 
   
   We can even optimize it away a bit and have two self-hosted types:
   
   1) Big 64 GB ones for the tests
   2) Smaller 32 GB ones  for everything else
   
   It should be rather easy to configure in the `CI.yml`, but I am not sure if the auto-scaling solution we have will handle two types?
   
   Here is the job that we have partial successes/failures and it shows how those tests will look like. This is actually a good  one to show how the tests will look like.  You can see that the output is nicely grouped and you can see very clearly the monitoring and progress (it will be much nicer when we have more memory because each test will progress much faster). Also I print summary of the failed tests at the end - only "failure" outputs are fully printed to the logs at the end with "Red" groups - this will make it far easier to analyze problems (the same kind of output improvement is in the sequential version of the tests run on GitHub runners).
   
   In this  case three test types succeeded (`Heisentests Core Providers`) and remaining 5 had some failures (most of them from what I see is due to timeouts, which is perfectly understandable if we run out of memory and started to swap out to remote SSD in the cloud): https://github.com/apache/airflow/pull/14531/checks?check_run_id=1999175947. 
   
   Rationale for bigger machines:
   
   From what I see we have machines with 32 GB and since half of it will easily be eaten by `tmpfs` when we start writing logs and the like, we only have ~16 GB which is not enough. During my tests: https://twitter.com/higrys/status/1366037359461101569/photo/1 all the tests running in parallel took ~35 GB of memory on my 64 GB machine. I had just local SSD not `tmpfs` for those tests, but i do not think we need 30 GB tmpfs for all logs, docker, tmp etc (and we can fine tune that if we do).  
   
   Also it is more important than before to clean-up the `tmpfs` volumes before each run and make them "pristine" for every run - because we will be using nearly all of it. I think that will also help with cases like #14505 where some left-overs from previous runs are causing the jobs to fail.
   
   This is the mchine state before the tests are run: 
   
   ```
                   total        used        free      shared  buff/cache   available
     Mem:           30Gi       696Mi        24Gi       3.4Gi       5.5Gi        26Gi
     Swap:            0B          0B          0B
   
     Filesystem      Size  Used Avail Use% Mounted on
     /dev/root       7.7G  2.6G  5.2G  34% /
     devtmpfs         16G     0   16G   0% /dev
     tmpfs            16G     0   16G   0% /dev/shm
     tmpfs           3.1G  804K  3.1G   1% /run
     tmpfs           5.0M     0  5.0M   0% /run/lock
     tmpfs            16G     0   16G   0% /sys/fs/cgroup
     tmpfs           3.1G  168K  3.1G   1% /tmp
     tmpfs            21G  2.9G   18G  15% /var/lib/docker
     tmpfs            16G  534M   15G   4% /home/runner/actions-runner/_work
     /dev/loop0       98M   98M     0 100% /snap/core/10185
     /dev/loop1       56M   56M     0 100% /snap/core18/1885
     /dev/loop2       71M   71M     0 100% /snap/lxd/16922
     /dev/loop3       29M   29M     0 100% /snap/amazon-ssm-agent/2012
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-792668550


   The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-803316512


   > An additional option: check the login still exists in the clean script (this is set as a ExecPost run in the action.runner.service file from the userdata
   
   Yeah. running aws from runner does not work - we have no permissions to read the password from secret manager there (and it's good). I think using ExecPre is even better than ExecPost actually. It already runs priviledged (with !) and in case it failes (I added -e), it will prevent the runner from running . So we will be absolutely sure that when the runner runs, it actually has the credentials.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r586582496



##########
File path: scripts/ci/testing/ci_run_airflow_testing.sh
##########
@@ -170,5 +161,282 @@ do
     echo "**********************************************************************************************"
 
     run_airflow_testing_in_docker "${@}"
+}
+
+
+# Cleans up Docker before test execution. System prune should clean all the temporary/unnamed images
+# And left-over volumes
+function cleanup_docker() {
+    echo
+    echo "System-prune docker"
+    echo
+    docker system prune --force --volumes
+    echo
+    echo "Check available space"
+    echo
+    df --human
+    echo
+    echo "Check available memory"
+    echo
+    free --human
+}
+
+# Prints status of a parallel test type. Returns 0 if the test type is still running, 1 otherwise
+#
+# $1 - Test type
+#
+# The test_pids variable should contain mapping of test types to pids
+# NOTE: It might be surprise that local variables (like test_pids) are visible to called functions
+#       But this is how local variables in bash work (by design).
+function print_test_type_status() {
+    local test_type=$1
+    local pid="${test_pids[${test_type}]}"
+    local log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+    if ps -p "${pid}" >/dev/null 2>/dev/null; then
+        if [[ ${TEST_SHOW_ALL_OUTPUT_LINES_AS_THEY_APPEAR} != "true" ]]; then
+            echo "${COLOR_BLUE}Test type ${test_type} in progress (last ${TEST_SHOW_OUTPUT_LINES} lines of the output shown).${COLOR_RESET}"
+            echo
+            tail -${TEST_SHOW_OUTPUT_LINES} "${log_file}"
+            echo
+            echo
+        fi
+        return 0
+    else
+        if wait "${pid}"; then
+            echo
+            echo "${COLOR_GREEN}Test type: ${test_type} succeeded.${COLOR_RESET}"
+            echo
+            return 1
+        else
+            echo
+            echo "${COLOR_RED}Test type: ${test_type} failed.${COLOR_RESET}"
+            echo
+            return 1
+        fi
+    fi
+}
+
+# Prints status of all parallel test types. Returns 0 if any of the tests are still running, 1 otherwise
+function print_all_test_types_status() {
+    local still_running="false"
+    local test_type
+    echo "${COLOR_YELLOW}Progress for all tests: ${COLOR_BLUE}${TEST_TYPES}${COLOR_RESET}"
+    echo
+    for test_type in ${TEST_TYPES}
+    do
+        if print_test_type_status "${test_type}" ; then
+            still_running="true"
+        fi
+    done
+    if [[ ${still_running} == "true" ]]; then
+        return 0
+    fi
+    return 1
+}
+
+# Starts all test types in parallel
+# TEST_TYPES - list of all test types (it's not an array, it is space-separate list)
+# ${@} - additional arguments to pass to test execution
+function start_test_types_in_parallel() {
+    local test_type
+    local database_port
+    start_end::group_start "Starting parallel tests: ${TEST_TYPES}"
+    for test_type in ${TEST_TYPES}
+    do
+        # Grabbing a free random port to use for database forwarding
+        database_port="$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1])')";
+        echo
+        echo "Grabbed database port for forwarding: ${database_port}"
+        log_file="${CACHE_TMP_FILE_DIR}/output-${test_type}.log"
+        TEST_TYPE=${test_type} MYSQL_HOST_PORT="${database_port}" POSTGRES_HOST_PORT="${database_port}" \
+            run_single_test_type "${@}" >"${log_file}" 2>&1 &

Review comment:
       Yep. Fir stime for me as well. Interesting to see the that the current PID does not work.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789303822


   Have you looked at pytest-xdist?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
ashb commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789693770


   > > Have you double checked the build time with this change on one of the proposed AWS instances? I'm nervous of blowing throw our AWS credits in 1 month if this change _doesn't_ help.
   > 
   > Yes. I already explained it when I created the PR (just scroll the PR few pages up).: [#14531 (comment)](https://github.com/apache/airflow/pull/14531#issuecomment-787506971) but I can explain it again and again, and again no problem. I am super patient person, though sometimes I am quite surprised that I have to answer questions that already have been answered (and ones that you even commented on).
   
   Not once in that comment you linked to did you say that you tested it *on AWS*.
   
   Both that linked comment, and the original PR description you say "On one personal PC"/"on my 64 GB machine". That is why I asked you to test that this helps as much on AWS as it does on your local machine.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on a change in pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on a change in pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#discussion_r591415464



##########
File path: tests/www/test_views.py
##########
@@ -1154,6 +1154,11 @@ def test_page_instance_name_xss_prevention(self):
 
 
 class TestConfigurationView(TestBase):
+    def setUp(self):
+        super().setUp()
+        with mock.patch.dict(os.environ, {"AIRFLOW__CORE__UNIT_TEST_MODE": "False"}):
+            initialize_config()

Review comment:
       Sure. This tests reads content of the airflow configuration and returns it to the caller. The test failed (when I run it in parallel) because the config was missing. The web api returns the "production" config - no matter if the UNIT_TEST_MODE is set to True or not. I have not yet had time to figure out why the "airflow.cfg" is missing when I run the test in parallel mode (it is there when we run tests sequentially). 
   
   I still have to figure it out (and planned to). My guess is that some factor influences the sequence of executing the tests and some other test simply deletes the configuration as part of its execution and the config never gets restored when the `test_configuration_expose_config' is executed. Unfortunately presence of config is 'implicit dependency' in many of our tests. The problem is that the config gets automatically created by airflow configuration code when it is missing. If any test depend on its presence it might went unnoticed that the config is actually a side effect of either other tests or simply running airflow cli. 
   
   So I believe my fix is `proper` fix (but there are probably other tests that have similar side-effects dependency). Since this test depends on the 'airflow.cfg` being present, this should not be side-efffect, it should be explicitly initialised int 'setUp' of the test. Which is precisely what I do - I make sure that the "unit_test_mode" is set to false, so the "production" and not "unittest" config is initialized and then I Initialise it.
   
   I think this is how the test should be written in the first place.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787506971


   Hey @ashb - we need bigger machines as I suspected :) . 
   
   The good news is that it will be much cheaper in the long run as we will need them for far less time.
   
   The tests are failing but mainly because of memory problems and timeouts (so I guess we are simply using too much of RAM , if we up the machine to 64 GB I think this should go rather smoothly. The good news is that even with not enough memory (and with failures/timeouts) the tests took ~26 m (!) for `sqlite` - rather than > 1 h, so when we have enough memory we can achieve the 15 minutes I was hoping for. Those 64 GB machines are only a bit more expensive than the 32 GB ones, so we will save a lot of credits when it works. 
   
   We can even optimize it away a bit and have two self-hosted types:
   
   1) Big 64 GB ones for the tests
   2) Smaller 32 GB ones  for everything else
   
   It should be rather easy to configure in the `CI.yml`, but I am not sure if the auto-scaling solution we have will handle two types?
   
   Here is the job that we have partial successes/failures and it shows how those tests will look like. This is actually a good  one to show how the tests will look like.  You can see that the output is nicely grouped and you can see very clearly the monitoring and progress (it will be much nicer when we have more memory because each test will progress much faster). Also I print summary of the failed tests at the end - only "failure" outputs are fully printed to the logs at the end with "Red" groups - this will make it far easier to analyze problems (the same kind of output improvement is in the sequential version of the tests run on GitHub runners).
   
   In this  case three test types succeeded (`Heisentests Core Providers`) and remaining 5 had some failures (most of them from what I see is due to timeouts, which is perfectly understandable if we run out of memory and started to swap out to remote SSD in the cloud): https://github.com/apache/airflow/pull/14531/checks?check_run_id=1999175947. 
   
   ## Rationale for bigger machines
   
   From what I see we have machines with 32 GB and since half of it will easily be eaten by `tmpfs` when we start writing logs and the like, we only have ~16 GB which is not enough. During my tests: https://twitter.com/higrys/status/1366037359461101569/photo/1 all the tests running in parallel took ~35 GB of memory on my 64 GB machine. I had just local SSD not `tmpfs` for those tests, but i do not think we need 30 GB tmpfs for all logs, docker, tmp etc (and we can fine tune that if we do).  
   
   Also it is more important than before to clean-up the `tmpfs` volumes before each run and make them "pristine" for every run - because we will be using nearly all of it. I think that will also help with cases like #14505 where some left-overs from previous runs are causing the jobs to fail.
   
   This is the mchine state before the tests are run: 
   
   ```
                   total        used        free      shared  buff/cache   available
     Mem:           30Gi       696Mi        24Gi       3.4Gi       5.5Gi        26Gi
     Swap:            0B          0B          0B
   
     Filesystem      Size  Used Avail Use% Mounted on
     /dev/root       7.7G  2.6G  5.2G  34% /
     devtmpfs         16G     0   16G   0% /dev
     tmpfs            16G     0   16G   0% /dev/shm
     tmpfs           3.1G  804K  3.1G   1% /run
     tmpfs           5.0M     0  5.0M   0% /run/lock
     tmpfs            16G     0   16G   0% /sys/fs/cgroup
     tmpfs           3.1G  168K  3.1G   1% /tmp
     tmpfs            21G  2.9G   18G  15% /var/lib/docker
     tmpfs            16G  534M   15G   4% /home/runner/actions-runner/_work
     /dev/loop0       98M   98M     0 100% /snap/core/10185
     /dev/loop1       56M   56M     0 100% /snap/core18/1885
     /dev/loop2       71M   71M     0 100% /snap/lxd/16922
     /dev/loop3       29M   29M     0 100% /snap/amazon-ssm-agent/2012
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-803372107


   Finally ALL GREEN ! Merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-803296288


   Hey @ashb - I had experienced quite a few problems with "pull limits with Docker" again. I have no idea where the credentials are deleted. I have a few candidates but I cannot pin-point any of them:
   
   * system prune of docker sometimes deleting the credentials (when they expired maybe) ?
   * unit tests of docker operator
   * cancelling the test while we are adding experimental flag
   
   My most probable hypothesis is now:
   
   * temporary failure to login when the runner gets initialized
   
   For me the most likely cause is the last one. I believe the problem comes from a temporary failure of logging to docker registry when runner gets initialized. It is rather plausible, we have no error generated when it happens. We do not have set -e in the init scripts, so this sounds very much like this. 
   
   For now I did a very simple thing - i simply repeat logging in in the CI scripts. It seems it is not necessary in the runner - we can do it only when we have RUNS_ON = self-hosted (the awscli is still installed and we have the same permissions that the runner code to retrieve the password from the secret). I am testing this. Eventually we will be able to remove this line from the runner init I believe.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-798931712


   [The Workflow run](https://github.com/apache/airflow/actions/runs/651642721) is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-788986596


   Hey @ashb -> would you update the machines to the bigger ones? Or can I do it myself? How do we proceed ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk merged pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
potiuk merged pull request #14531:
URL: https://github.com/apache/airflow/pull/14531


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on pull request #14531: Running tests in parallel

Posted by GitBox <gi...@apache.org>.
potiuk commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-801448923


   FYI. I vastly simplified the parallelisation bash code here. I am leaning much more on `GNU parallel` functionality rather than manually managing PIDs. It is much simpler and straightforward now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-795322986


   > Before I merged this one I added one more change. This one should allow to dynamically choose (using gnu parallel --semaphore) how many tests we run in parallel depending on the number of CPUs (overrideable with `MAX_PARALLEL_TEST_JOBS` . This should allow to get ~20 minutes even with smaller machines I believe. Also it is very likely (testing it in https://github.com/potiuk/airflow/actions/runs/637728789 that even with default 2 CPUs for public GitHub Runners we should be able to achieve ~2x speedup. That would be awesome.
   
   Just update on this one:
   
   * indeed the parallel semaphore to limit the number of parallel test types works pretty nicely. With the 4CPU, 32GB machines we get ~ 30 minutes tests which is pretty much what I would expect.
   * The nice thing is that the tests will automatically adjust parallelism level depending on number of CPUs/Cores available which is also great for developers running the tests on they development machines
   * I almost got it running with ~ 45 minutes tests in the Github Public runners. There are some hangs which I have to look at - likely due to memory usage. But maybe we can limit that further and get it a bit faster
   * I want to have predictable sequence of the test types. Right now the execution length of the tests varies a lot because parallel semaphore released the jobs in random order and we might have long time where only last, longest job is run. I would rather get the longest jobs start first followed by the shortest ones as this will give bettter parallelism use and predictability. will have to take a look into that. 
   * there are still few tests failing (but there is a different reason for that I will talk to @kaxil about it


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-789386354


   > Have you looked at pytest-xdist?
   
   As I already explained several times - yes, we looked at this several times in the past  by different people. First time 2 years ago and we got scared by the results. The result were always the same - it failed big time. You can try yourself if you want.
   
   Our tests (big number of them)  rely on shared database which is reset and reinitialized and filled wth data/cleaned by multiple tests. The shared database is shared state resource and if we try to run tests in parallel using xdist, they override each other data and fail completely randomly. There is no way to run parallel tests sharing the same database (unless we completely redefine our tests and always mock the database or somehow isolate all the tests that use the DB. 
   
   Particularly quite  lot of tests from the core (including the scheduler ones - I guess you are familiar with those) actually use the DB and are not ready to be run in parallel. I think you can imagine best what starts happening if you run those multiple scheduler tests against the same database in parallel. But those are not the only ones. Airflow test suite encourages people to use the DB during their testing.
   
   There are also a number of tests that rely on side effects from other tests (stored in the same database).  
   
   My solution solves it in the way that every type of tests has its own database (and in each "type group" the tests are run sequentially). Then  - groups are run in parallel. This way they cannot override each other data and since I already run each group separately in the 'sequential' approach, I know that side effects are at least 'under control' (i.e. they do not show up).
   
   Of course if we leave in ideal world we would have no side effects and no shared database, in which case pytest-xdist would work for us. But, unfortunately our small Airflow world is not perfect. And fixing it to be perfect and ideal could likely take weeks or months of work by isolating, fixing and mocking out all the tests. It's a great regret of mine, but I believe noone in the community has time for it, so I prefer to implement what can give immediate effect and can be implemented with very small effort.
   
   But I do encourage you to take on the task and fix all the tests. That would be perfect to have it at some point in time. Also it would likely drive down the time needed to run the tests even more.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] github-actions[bot] commented on pull request #14531: Running tests in parallel for self-hosted runners

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787496898


   [The Workflow run](https://github.com/apache/airflow/actions/runs/608288234) is cancelling this PR. Building images for the PR has failed. Follow the the workflow link to check the reason.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org