You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/02/28 19:23:21 UTC

[GitHub] [airflow] potiuk edited a comment on pull request #14531: Running tests in parallel for self-hosted runners

potiuk edited a comment on pull request #14531:
URL: https://github.com/apache/airflow/pull/14531#issuecomment-787506971


   Hey @ashb - we need bigger machines as I suspected :) . 
   
   The good news is that it will be much cheaper in the long run as we will need them for far less time.
   
   The tests are failing but mainly because of memory problems and timeouts (so I guess we are simply using too much of RAM , if we up the machine to 64 GB I think this should go rather smoothly. The good news is that even with not enough memory (and with failures/timeouts) the tests took ~26 m for `sqlite` - rather than > 1 h, so when we have enough memory we can achieve the 15 minutes I was hoping for. Those 64 GB machines are only a bit more expensive than the 32 GB ones, so we will save a lot of credits when it works. 
   
   We can even optimize it away a bit and have two self-hosted types:
   
   1) Big 64 GB ones for the tests
   2) Smaller 32 GB ones  for everything else
   
   It should be rather easy to configure in the `CI.yml`, but I am not sure if the auto-scaling solution we have will handle two types?
   
   Here is the job that we have partial successes/failures and it shows how those tests will look like. This is actually a good  one to show how the tests will look like.  You can see that the output is nicely grouped and you can see very clearly the monitoring and progress (it will be much nicer when we have more memory because each test will progress much faster). Also I print summary of the failed tests at the end - only "failure" outputs are fully printed to the logs at the end with "Red" groups - this will make it far easier to analyze problems (the same kind of output improvement is in the sequential version of the tests run on GitHub runners).
   
   In this  case three test types succeeded (`Heisentests Core Providers`) and remaining 5 had some failures (most of them from what I see is due to timeouts, which is perfectly understandable if we run out of memory and started to swap out to remote SSD in the cloud): https://github.com/apache/airflow/pull/14531/checks?check_run_id=1999175947. 
   
   Rationale for bigger machines:
   
   From what I see we have machines with 32 GB and since half of it will easily be eaten by `tmpfs` when we start writing logs and the like, we only have ~16 GB which is not enough. During my tests: https://twitter.com/higrys/status/1366037359461101569/photo/1 all the tests running in parallel took ~35 GB of memory on my 64 GB machine. I had just local SSD not `tmpfs` for those tests, but i do not think we need 30 GB tmpfs for all logs, docker, tmp etc (and we can fine tune that if we do).  
   
   Also it is more important than before to clean-up the `tmpfs` volumes before each run and make them "pristine" for every run - because we will be using nearly all of it. I think that will also help with cases like #14505 where some left-overs from previous runs are causing the jobs to fail.
   
   ```
                         total        used        free      shared  buff/cache   available
     Mem:           30Gi       696Mi        24Gi       3.4Gi       5.5Gi        26Gi
     Swap:            0B          0B          0B
   
     Filesystem      Size  Used Avail Use% Mounted on
     /dev/root       7.7G  2.6G  5.2G  34% /
     devtmpfs         16G     0   16G   0% /dev
     tmpfs            16G     0   16G   0% /dev/shm
     tmpfs           3.1G  804K  3.1G   1% /run
     tmpfs           5.0M     0  5.0M   0% /run/lock
     tmpfs            16G     0   16G   0% /sys/fs/cgroup
     tmpfs           3.1G  168K  3.1G   1% /tmp
     tmpfs            21G  2.9G   18G  15% /var/lib/docker
     tmpfs            16G  534M   15G   4% /home/runner/actions-runner/_work
     /dev/loop0       98M   98M     0 100% /snap/core/10185
     /dev/loop1       56M   56M     0 100% /snap/core18/1885
     /dev/loop2       71M   71M     0 100% /snap/lxd/16922
     /dev/loop3       29M   29M     0 100% /snap/amazon-ssm-agent/2012
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org