You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/10/08 22:25:33 UTC

[GitHub] [airflow] teastburn opened a new issue #11365: Scheduler out of memory / stuck

teastburn opened a new issue #11365:
URL: https://github.com/apache/airflow/issues/11365


   <!--
   
   Welcome to Apache Airflow!  For a smooth issue process, try to answer the following questions.
   Don't worry if they're not all applicable; just try to include what you can :-)
   
   If you need to include code snippets or logs, please put them in fenced code
   blocks.  If they're super-long, please use the details tag like
   <details><summary>super-long log</summary> lots of stuff </details>
   
   Please delete these comment blocks before submitting the issue.
   
   -->
   
   <!--
   
   IMPORTANT!!!
   
   PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE
   NEXT TO "SUBMIT NEW ISSUE" BUTTON!!!
   
   PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!!
   
   Please complete the next sections or the issue will be closed.
   These questions are the first thing we need to know to understand the context.
   
   -->
   Thanks in advance for your help and work on Airflow. ❤️ 
   
   **Apache Airflow version**: 1.10.12
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`): Celery
   
   **Environment**: 
   
   - **Cloud provider or hardware configuration**: AWS ECS
   - **OS** (e.g. from /etc/os-release): Ubuntu 18.04
   - **Kernel** (e.g. `uname -a`): 4.15.0
   - **DB**: Postgres (AWS RDS)
   - **Scheduler settings**:
   max_threads = 10 
   job_heartbeat_sec = 5
   scheduler_heartbeat_sec = 30
   run_duration = 600
   num_runs = -1
   processor_poll_interval = 1
   min_file_process_interval = 30
   dag_dir_list_interval = 300
   scheduler_health_check_threshold = 300
   scheduler_zombie_task_threshold = 300
   max_tis_per_query = 64
   
   - **Install tools**: conda, pip, ???
   - **Others**: 
   
   **What happened**:
   
   After upgrading from 1.8.2 to 1.10.12 we experience ~1-5 scheduler out of memory (OOM) issues per day. The CPU will bottom out and the scheduler will stop scheduling new work. A container restart will bring up a new scheduler which will work until the next OOM.
   
   **What you expected to happen**:
   
   Scheduler to use a normal amount of CPU & RAM, not exceed MAX_THREADS and to continue schedulering new work.
   
   **How to reproduce it**:
   
   Let scheduler run for a day. Sorry I don't have much more data. We run ~250 dags across ~25 files and ~5000 tasks per hour. 
   
   **Similar issues**: https://github.com/apache/airflow/issues/7935 -- we also experience this issue, and they seem related.
   
   
   **Anything else we need to know**:
   
   After the upgrade we raised our RAM from ~2GB to ~6GB and we still get this issue. There is no reason our scheduler should need ~6GB of RAM.
   
   ![Screen Shot 2020-10-08 at 2 30 26 PM](https://user-images.githubusercontent.com/134710/95515737-f2d68980-0972-11eb-907d-9a5e82492bba.png)
   Above is an example of the scheduler CPU and RAM during OOM event. Recovery was done manually. 
   
   OOM logs for Python process (not always the same as this):
   ```
   2020-10-06T05:34:18.092Z,OSError: [Errno 12] Cannot allocate memory
   2020-10-06T05:34:18.092Z,"    self.pid = os.fork()"
   2020-10-06T05:34:18.092Z,"  File ""/conda/env/lib/python2.7/multiprocessing/forking.py"", line 121, in __init__"
   2020-10-06T05:34:18.092Z,"    self._popen = Popen(self)"
   2020-10-06T05:34:18.092Z,"  File ""/conda/env/lib/python2.7/multiprocessing/process.py"", line 130, in start"
   2020-10-06T05:34:18.092Z,"    self._process.start()"
   2020-10-06T05:34:18.092Z,"  File ""/airflow/airflow/jobs/scheduler_job.py"", line 203, in start"
   2020-10-06T05:34:18.092Z,"    processor.start()"
   2020-10-06T05:34:18.092Z,"  File ""/airflow/airflow/utils/dag_processing.py"", line 1250, in start_new_processes"
   2020-10-06T05:34:18.092Z,"    self.start_new_processes()"
   2020-10-06T05:34:18.092Z,"  File ""/airflow/airflow/utils/dag_processing.py"", line 886, in start"
   2020-10-06T05:34:18.091Z,"    processor_manager.start()"
   2020-10-06T05:34:18.091Z,"  File ""/airflow/airflow/utils/dag_processing.py"", line 634, in _run_processor_manager"
   2020-10-06T05:34:18.091Z,"    self._target(*self._args, **self._kwargs)"
   2020-10-06T05:34:18.091Z,"  File ""/conda/env/lib/python2.7/multiprocessing/process.py"", line 114, in run"
   2020-10-06T05:34:18.091Z,"    self.run()"
   2020-10-06T05:34:18.091Z,"  File ""/conda/env/lib/python2.7/multiprocessing/process.py"", line 267, in _bootstrap"
   2020-10-06T05:34:18.091Z,Traceback (most recent call last):
   2020-10-06T05:34:18.091Z,Process Process-1:
   ```
   
   <details><summary>OOM logs from host OS (there were 18 separate oom-killer events, this is one)</summary>
   
   ```
   [201596.826370] Killed process 73325 (/conda/env/) total-vm:2848568kB, anon-rss:166544kB, file-rss:7928kB, shmem-rss:4kB
   [201596.826340] Memory cgroup out of memory: Kill process 86085 (/conda/env/) score 33 or sacrifice child
   [201596.826316] [73474]     0 73474   660268    38726  1187840        0             0 airflow schedul
   [201596.826314] [73472]     0 73472   660268    39102  1196032        0             0 airflow schedul
   [201596.826311] [73445]     0 73445   660268    39135  1196032        0             0 airflow schedul
   [201596.826309] [73436]     0 73436   660268    39184  1196032        0             0 airflow schedul
   [201596.826308] [73435]     0 73435   660268    39135  1196032        0             0 airflow schedul
   [201596.826307] [73434]     0 73434   660268    39134  1196032        0             0 airflow schedul
   [201596.826305] [73410]     0 73410   660268    39752  1204224        0             0 airflow schedul
   [201596.826304] [73405]     0 73405   712142    42404  1224704        0             0 /conda/env/
   [201596.826302] [73404]     0 73404   712142    42404  1224704        0             0 /conda/env/
   [201596.826300] [73403]     0 73403   712142    42404  1224704        0             0 /conda/env/
   [201596.826299] [73402]     0 73402   712142    42404  1224704        0             0 /conda/env/
   [201596.826297] [73401]     0 73401   712142    42404  1224704        0             0 /conda/env/
   [201596.826295] [73400]     0 73400   712142    42404  1224704        0             0 /conda/env/
   [201596.826293] [73399]     0 73399   712142    42404  1224704        0             0 /conda/env/
   [201596.826291] [73398]     0 73398   712142    42404  1224704        0             0 /conda/env/
   [201596.826290] [73397]     0 73397   712142    42404  1224704        0             0 /conda/env/
   [201596.826287] [73396]     0 73396   712142    42404  1224704        0             0 /conda/env/
   [201596.826286] [73395]     0 73395   712142    42404  1224704        0             0 /conda/env/
   [201596.826284] [73394]     0 73394   712142    42404  1224704        0             0 /conda/env/
   [201596.826282] [73393]     0 73393   712142    42404  1224704        0             0 /conda/env/
   [201596.826280] [73392]     0 73392   712142    42404  1224704        0             0 /conda/env/
   [201596.826279] [73391]     0 73391   712142    42410  1224704        0             0 /conda/env/
   [201596.826277] [73390]     0 73390   712142    42404  1224704        0             0 /conda/env/
   [201596.826275] [73389]     0 73389   712142    42404  1224704        0             0 /conda/env/
   [201596.826274] [73388]     0 73388   712142    42404  1224704        0             0 /conda/env/
   [201596.826272] [73387]     0 73387   712142    42404  1224704        0             0 /conda/env/
   [201596.826270] [73386]     0 73386   712142    42404  1224704        0             0 /conda/env/
   [201596.826269] [73385]     0 73385   712142    42404  1224704        0             0 /conda/env/
   [201596.826266] [73384]     0 73384   712142    42404  1224704        0             0 /conda/env/
   [201596.826264] [73383]     0 73383   712142    42404  1224704        0             0 /conda/env/
   [201596.826263] [73382]     0 73382   712142    42404  1224704        0             0 /conda/env/
   [201596.826261] [73381]     0 73381   712142    42404  1224704        0             0 /conda/env/
   [201596.826259] [73380]     0 73380   712142    42404  1224704        0             0 /conda/env/
   [201596.826258] [73379]     0 73379   712142    42404  1224704        0             0 /conda/env/
   [201596.826256] [73378]     0 73378   712142    42404  1224704        0             0 /conda/env/
   [201596.826254] [73377]     0 73377   712142    42404  1224704        0             0 /conda/env/
   [201596.826252] [73376]     0 73376   712142    42404  1224704        0             0 /conda/env/
   [201596.826250] [73375]     0 73375   712142    42404  1224704        0             0 /conda/env/
   [201596.826248] [73374]     0 73374   712142    42404  1224704        0             0 /conda/env/
   [201596.826247] [73373]     0 73373   712142    42404  1224704        0             0 /conda/env/
   [201596.826245] [73372]     0 73372   712142    42592  1228800        0             0 /conda/env/
   [201596.826244] [73371]     0 73371   712142    43187  1236992        0             0 /conda/env/
   [201596.826242] [73370]     0 73370   712142    43171  1236992        0             0 /conda/env/
   [201596.826240] [73369]     0 73369   712142    43187  1236992        0             0 /conda/env/
   [201596.826239] [73368]     0 73368   712142    43013  1236992        0             0 /conda/env/
   [201596.826237] [73367]     0 73367   712142    43013  1236992        0             0 /conda/env/
   [201596.826236] [73366]     0 73366   712142    43029  1236992        0             0 /conda/env/
   [201596.826235] [73365]     0 73365   712142    43029  1236992        0             0 /conda/env/
   [201596.826233] [73364]     0 73364   712142    43579  1236992        0             0 /conda/env/
   [201596.826232] [73363]     0 73363   712142    43588  1236992        0             0 /conda/env/
   [201596.826230] [73362]     0 73362   712142    43579  1236992        0             0 /conda/env/
   [201596.826229] [73361]     0 73361   712142    43588  1236992        0             0 /conda/env/
   [201596.826227] [73360]     0 73360   712142    43588  1236992        0             0 /conda/env/
   [201596.826225] [73359]     0 73359   712142    43619  1236992        0             0 /conda/env/
   [201596.826224] [73358]     0 73358   712142    43619  1236992        0             0 /conda/env/
   [201596.826222] [73357]     0 73357   712142    43619  1236992        0             0 /conda/env/
   [201596.826221] [73356]     0 73356   712142    43619  1236992        0             0 /conda/env/
   [201596.826219] [73355]     0 73355   712142    43619  1236992        0             0 /conda/env/
   [201596.826217] [73354]     0 73354   712142    43619  1236992        0             0 /conda/env/
   [201596.826215] [73353]     0 73353   712142    43619  1236992        0             0 /conda/env/
   [201596.826214] [73352]     0 73352   712142    43614  1236992        0             0 /conda/env/
   [201596.826212] [73351]     0 73351   712142    43618  1236992        0             0 /conda/env/
   [201596.826210] [73350]     0 73350   712142    43612  1236992        0             0 /conda/env/
   [201596.826208] [73349]     0 73349   712142    43619  1236992        0             0 /conda/env/
   [201596.826207] [73348]     0 73348   712142    43561  1236992        0             0 /conda/env/
   [201596.826205] [73347]     0 73347   712142    43551  1236992        0             0 /conda/env/
   [201596.826203] [73346]     0 73346   712142    43619  1236992        0             0 /conda/env/
   [201596.826201] [73345]     0 73345   712142    43619  1236992        0             0 /conda/env/
   [201596.826199] [73344]     0 73344   712142    43619  1236992        0             0 /conda/env/
   [201596.826198] [73343]     0 73343   712142    43619  1236992        0             0 /conda/env/
   [201596.826196] [73342]     0 73342   712142    43619  1236992        0             0 /conda/env/
   [201596.826195] [73341]     0 73341   712142    43619  1236992        0             0 /conda/env/
   [201596.826193] [73340]     0 73340   712142    43619  1236992        0             0 /conda/env/
   [201596.826192] [73339]     0 73339   712142    43619  1236992        0             0 /conda/env/
   [201596.826190] [73338]     0 73338   712142    43619  1236992        0             0 /conda/env/
   [201596.826189] [73337]     0 73337   712142    43619  1236992        0             0 /conda/env/
   [201596.826187] [73336]     0 73336   712142    43619  1236992        0             0 /conda/env/
   [201596.826185] [73335]     0 73335   712142    43619  1236992        0             0 /conda/env/
   [201596.826184] [73334]     0 73334   712142    43619  1236992        0             0 /conda/env/
   [201596.826182] [73333]     0 73333   712142    43619  1236992        0             0 /conda/env/
   [201596.826180] [73332]     0 73332   712142    43619  1236992        0             0 /conda/env/
   [201596.826178] [73331]     0 73331   712142    43619  1236992        0             0 /conda/env/
   [201596.826176] [73330]     0 73330   712142    43619  1236992        0             0 /conda/env/
   [201596.826175] [73329]     0 73329   712142    43619  1236992        0             0 /conda/env/
   [201596.826173] [73328]     0 73328   712142    43583  1236992        0             0 /conda/env/
   [201596.826172] [73327]     0 73327   712142    43574  1236992        0             0 /conda/env/
   [201596.826170] [73326]     0 73326   712142    43614  1236992        0             0 /conda/env/
   [201596.826168] [73325]     0 73325   712142    43619  1236992        0             0 /conda/env/
   [201596.826165] [73321]     0 73321   712142    43609  1236992        0             0 /conda/env/
   [201596.826164] [73320]     0 73320   712142    43614  1236992        0             0 /conda/env/
   [201596.826161] [73317]     0 73317   712142    43533  1236992        0             0 /conda/env/
   [201596.826159] [73316]     0 73316   712142    43535  1236992        0             0 /conda/env/
   [201596.826158] [73315]     0 73315   712142    43617  1236992        0             0 /conda/env/
   [201596.826156] [73314]     0 73314   712142    43617  1236992        0             0 /conda/env/
   [201596.826147] [73254]     0 73254   665259    44877  1245184        0             0 airflow schedul
   [201596.826146] [73253]     0 73253   663081    42595  1224704        0             0 airflow schedul
   [201596.826144] [73247]     0 73247   660815    38868  1187840        0             0 airflow schedul
   [201596.826142] [73241]     0 73241   665744    45272  1245184        0             0 airflow schedul
   [201596.826090] [72514]     0 72514     1136      200    57344        0             0 sleep
   [201596.826083] [71891]     0 71891     1136      190    57344        0             0 sleep
   [201596.825820] [86649]     0 86649   660268    38751  1191936        0             0 airflow schedul
   [201596.825818] [86085]     0 86085   712142    51436  1343488        0             0 /conda/env/
   [201596.825727] [74651]     0 74651     4630      805    81920        0             0 run_airflow.sh
   [201596.825725] [74634]     0 74634     1160      431    53248        0             0 update-dags-che
   [201596.825724] [74543]     0 74543     4630      847    86016        0             0 start
   [201596.825060] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
   [201596.825037] Memory cgroup stats for /ecs/99ee477ab6cd4c6988a5ad1476591f6c/3528c8b404fe2558ba6e47bc84fd515f5a81ea14ed4f6c0a8b7cf668aeeecf45: cache:64KB rss:5785836KB rss_huge:0KB shmem:32KB mapped_file:12KB dirty:0KB writeback:0KB inactive_anon:16KB active_anon:5785296KB inactive_file:28KB active_file:0KB unevictable:0KB
   [201596.825037] kmem: usage 358100kB, limit 9007199254740988kB, failcnt 0
   [201596.825036] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
   [201596.825036] memory: usage 6144000kB, limit 6144000kB, failcnt 19653
   [201596.825031] Task in /ecs/99ee477ab6cd4c6988a5ad1476591f6c/3528c8b404fe2558ba6e47bc84fd515f5a81ea14ed4f6c0a8b7cf668aeeecf45 killed as a result of limit of /ecs/99ee477ab6cd4c6988a5ad1476591f6c/3528c8b404fe2558ba6e47bc84fd515f5a81ea14ed4f6c0a8b7cf668aeeecf45
   [201596.825030] R13: 00000000047d1970 R14: 00007f0fe8ac2090 R15: 00000000047d1970
   [201596.825030] R10: 00000000012de010 R11: 0000000000000004 R12: 000000000000ff00
   [201596.825029] RBP: 00007f0fe8c0d5d0 R08: 0000000000000000 R09: 0000000000000000
   [201596.825029] RDX: 00000000047d1994 RSI: 0000000000000000 RDI: 00000000047d5000
   [201596.825028] RAX: 0000000000000000 RBX: 000000000000ff00 RCX: 000000000000c894
   [201596.825027] RSP: 002b:00007ffd2ff66bb8 EFLAGS: 00010206
   [201596.825026] RIP: 0033:0x7f0fe7ac419d
   [201596.825024]  async_page_fault+0x45/0x50
   [201596.825022]  do_async_page_fault+0x51/0x80
   [201596.825019]  ? async_page_fault+0x2f/0x50
   [201596.825015]  do_page_fault+0x2e/0xe0
   [201596.825014]  __do_page_fault+0x4a5/0x4d0
   [201596.825013]  mm_fault_error+0x90/0x180
   [201596.825007]  pagefault_out_of_memory+0x36/0x7b
   [201596.825006]  ? mem_cgroup_css_online+0x40/0x40
   [201596.825004]  mem_cgroup_oom_synchronize+0x2e8/0x320
   [201596.825002]  mem_cgroup_out_of_memory+0x4b/0x80
   [201596.824998]  out_of_memory+0x2d1/0x4f0
   [201596.824996]  oom_kill_process+0x220/0x440
   [201596.824994]  dump_header+0x71/0x285
   [201596.824988]  dump_stack+0x63/0x8b
   [201596.824980] Call Trace:
   [201596.824980] Hardware name: Amazon EC2 c5d.24xlarge/, BIOS 1.0 10/16/2017
   [201596.824979] CPU: 21 PID: 73436 Comm: airflow schedul Not tainted 4.15.0-1039-aws #41-Ubuntu
   [201596.824974] airflow schedul cpuset=3528c8b404fe2558ba6e47bc84fd515f5a81ea14ed4f6c0a8b7cf668aeeecf45 mems_allowed=0-1
   [201596.824973] airflow schedul invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=0
   [201596.735306] oom_reaper: reaped process 73324 (/conda/env/), now anon-rss:0kB, file-rss:0kB, shmem-rss:8kB
   ```
   
   </details>
   
   
   
   Similar to issue https://github.com/apache/airflow/issues/7935 we see many weird processes spawned that look like dupes of the main scheduler process (they are not dag processing child processes):
   
   <details><summary>Normal dag processing (one dag being processed)</summary>
   <p>
   
   ```
   $ ps faux
   USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
   root         1  0.0  0.0  18516  3264 ?        Ss   02:23   0:00 /bin/bash ./start scheduler -r 600
   root        35  0.0  0.0  18520  3284 ?        S    02:23   0:00 /bin/bash /app/run_airflow.sh scheduler -r 600
   root     18481  5.7  0.0 2210036 176712 ?      S    04:34   0:10  \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     18541  0.5  0.0 2011744 139364 ?      S    04:34   0:00      \_ airflow scheduler -- DagFileProcessorManager
   root     20173  0.0  0.0 2011744 136904 ?      R    04:37   0:00          \_ airflow scheduler - DagFileProcessor /dags/db_stream_2datalake/stream_2datalake.py
   ```
   </p>
   </details>
   
   <details><summary>Weird dag processing (excess of duplicate threads taking up RAM?). This can be seen by just running `ps faux` a bunch of times during normal/non OOM times.</summary>
   <p>
   
   ```
   $ ps faux
   USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
   root         1  0.0  0.0  18516  3264 ?        Ss   02:23   0:01 /bin/bash ./start scheduler -r 600
   root        35  0.0  0.0  18520  3284 ?        S    02:23   0:00 /bin/bash /app/run_airflow.sh scheduler -r 600
   root     29041  4.3  0.0 2211608 177860 ?      Sl   04:54   0:25  \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     29101  0.5  0.0 2011728 139212 ?      S    04:54   0:03      \_ airflow scheduler -- DagFileProcessorManager
   root     36058 69.3  0.0 2016512 148300 ?      S    05:04   0:02      |   \_ airflow scheduler - DagFileProcessor /dags/pubsub_hourly/pubsub_hourly.py
   root     36123  0.0  0.0 2211624 142008 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36124  0.0  0.0 2211624 142004 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36125  0.0  0.0 2211624 142008 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36126  0.0  0.0 2211624 142012 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36127  0.0  0.0 2211624 142012 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36128  0.0  0.0 2211624 142012 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36129  0.0  0.0 2211624 142012 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36130  0.0  0.0 2211624 142012 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36131  0.0  0.0 2211624 142012 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36132  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36133  0.0  0.0 2211624 142028 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36134  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36135  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36136  0.0  0.0 2211624 142028 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36137  0.0  0.0 2211624 142040 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36138  0.0  0.0 2211624 142040 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36139  0.0  0.0 2211624 142040 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36140  0.0  0.0 2211624 142028 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36141  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36142  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36143  0.0  0.0 2211624 142028 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36144  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36145  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36146  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36147  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36148  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36149  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36150  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36151  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36152  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36153  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36154  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36155  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36156  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36157  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36158  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36159  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36160  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36161  0.0  0.0 2211624 142016 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36162  0.0  0.0 2211624 137964 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36163  0.0  0.0 2211624 137964 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36164  0.0  0.0 2211624 137964 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36165  0.0  0.0 2211624 137964 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36166  0.0  0.0 2211624 137964 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36167  0.0  0.0 2211624 137964 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36168  0.0  0.0 2211624 137964 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   root     36169  0.0  0.0 2211624 137964 ?      S    05:04   0:00      \_ /conda/env/bin/python /conda/env/bin/airflow scheduler -r 600
   ```
   
   </p>
   </details>
   
   <details><summary>Number of Airflow processes running at a time, every .5 seconds (max_threads is 10)</summary>
   
   <p>
   
   ```
   $ while true; do pgrep -f 'airflow scheduler' | wc -l; sleep .5; done
   39
   4
   4
   4
   39
   39
   39
   39
   39
   5
   5
   5
   5
   5
   5
   5
   3
   3
   3
   38
   3
   3
   2
   2
   2
   2
   2
   37
   2
   2
   2
   2
   2
   2
   2
   7
   2
   8
   3
   8
   2
   4
   3
   3
   3
   3
   2
   2
   2
   2
   2
   2
   2
   2
   4
   3
   3
   3
   9
   3
   3
   3
   13
   3
   3
   3
   17
   2
   2
   2
   2
   2
   2
   2
   24
   2
   2
   4
   ```
   
   </p>
   
   </details>
   
   I will try to get a py-spy dump of a few processes after the next OOM event.
   
   Any help would be much appreciated! Our on call engineers are having sleepless nights.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #11365: Scheduler out of memory / stuck

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #11365:
URL: https://github.com/apache/airflow/issues/11365#issuecomment-705855315


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] teastburn commented on issue #11365: Scheduler out of memory / stuck

Posted by GitBox <gi...@apache.org>.
teastburn commented on issue #11365:
URL: https://github.com/apache/airflow/issues/11365#issuecomment-814240759


   > @teastburn Are you still on 1.10.12?
   
   @ashb Yep we are. No plans to upgrade until a huge leap to 2.x at some point (likely not soon as it looks like a large undertaking).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal commented on issue #11365: Scheduler out of memory / stuck

Posted by GitBox <gi...@apache.org>.
eladkal commented on issue #11365:
URL: https://github.com/apache/airflow/issues/11365#issuecomment-938675496


   This issue is reported against old version of Airflow.
   The scheduler has been refactored significantly since.
   If you are still experiencing issues with latest version please open a new issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #11365: Scheduler out of memory / stuck

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #11365:
URL: https://github.com/apache/airflow/issues/11365#issuecomment-705855315


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #11365: Scheduler out of memory / stuck

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #11365:
URL: https://github.com/apache/airflow/issues/11365#issuecomment-804780417


   @teastburn Are you still on 1.10.12?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] teastburn commented on issue #11365: Scheduler out of memory / stuck

Posted by GitBox <gi...@apache.org>.
teastburn commented on issue #11365:
URL: https://github.com/apache/airflow/issues/11365#issuecomment-767166524


   We are no longer experiencing this issue but are not sure why / what changed to resolve this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] teastburn commented on issue #11365: Scheduler out of memory / stuck

Posted by GitBox <gi...@apache.org>.
teastburn commented on issue #11365:
URL: https://github.com/apache/airflow/issues/11365#issuecomment-767166524


   We are no longer experiencing this issue but are not sure why / what changed to resolve this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] eladkal closed issue #11365: Scheduler out of memory / stuck

Posted by GitBox <gi...@apache.org>.
eladkal closed issue #11365:
URL: https://github.com/apache/airflow/issues/11365


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org