You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/12/11 22:40:34 UTC

[GitHub] [airflow] rafidka opened a new issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

rafidka opened a new issue #13026:
URL: https://github.com/apache/airflow/issues/13026


   **Apache Airflow version**: 1.10.12
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`): N/A
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: Amazon `r5.4xlarge` EC2 instance.
   - **OS** (e.g. from /etc/os-release): Amazon Linux 2
   - **Kernel** (e.g. `uname -a`): 5.4.58-37.125.amzn2int.x86_64 (via `uname -r`)
   - **Install tools**: N/A
   - **Others**: N/A
   
   **What happened**:
   
   I am experiencing some strange issue with scheduler heartbeat. If I [configure it](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#scheduler-heartbeat-sec) to 5 seconds, somehow the heartbeat metric is received every 6 seconds. I tried different values and what I noticed is that with values of 5 or less, the heartbeat is received one second later than expected. But with 6 or more, the heartbeat is received at the expected time (though for 2 and 6 I experienced an even stranger fluctuating behaviour). Below is a table of the values I tried and the frequencies I received:
   
   scheduler_heartbeat_sec value|frequency of scheduler_heartbeat metric
   -|-
   0|2
   1|2
   2|Fluctuates between 2 and 4
   3|4
   5|Fluctuates between 6 and 8
   6|6
   10|10
   30|30
   
   **What you expected to happen**:
   
   I expect the `airflow.scheduler_heartbeat` metric to be received at the same frequency specified by the `scheduler_heartbeat_sec` configuration.
   
   **How to reproduce it**:
   
   - Install Airflow locally via pip
   - Enable StatsD metrics via the airflow.cfg configurations:
   ```
   # Statsd (https://github.com/etsy/statsd) integration settings
   statsd_on = True
   statsd_host = localhost
   statsd_port = 8125
   statsd_prefix = airflow
   ```
   - Run Airflow webserver and scheduler and enable some DAGs.
   - Execute the following Python code and watch the timestamp of the received metrics:
   ```python
   import socket
   from datetime import datetime
   
   UDP_IP = "127.0.0.1"
   UDP_PORT = 8125
   
   sock = socket.socket(socket.AF_INET, # Internet
                        socket.SOCK_DGRAM) # UDP
   sock.bind((UDP_IP, UDP_PORT))
   
   prev_iter_time = None
   while True:
       data, addr = sock.recvfrom(1024) # buffer size is 1024 bytes
       message = str(data)
       if 'airflow.scheduler_heartbeat' in message:
           this_iter_time = datetime.now()
           diff = (this_iter_time - prev_iter_time).total_seconds() if prev_iter_time is not None else 0
           print(f"{this_iter_time} [{diff} seconds] received message: {message}.")
           prev_iter_time = this_iter_time
   ```
   
   This is an example output when I set `scheduler_heartbeat_sec` to 5:
   
   ```
   2020-12-11 22:18:34.747872 [0 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:18:40.760118 [6.012246 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:18:46.765760 [6.005642 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:18:52.771790 [6.00603 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:18:58.778018 [6.006228 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:19:04.784721 [6.006703 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:19:10.789910 [6.005189 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   ```
   
   And this is another example when I set `scheduler_heartbeat_sec` to 6:
   
   ```
   2020-12-11 22:19:56.306896 [0 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:20:02.317303 [6.010407 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:20:10.324169 [8.006866 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:20:18.331488 [8.007319 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:20:24.338565 [6.007077 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:20:30.345710 [6.007145 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:20:38.347814 [8.002104 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:20:44.358943 [6.011129 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:20:52.368052 [8.009109 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   ```
   
   Notice that this time it fluctuates between 6 and 8 for reason.
   
   Now, setting `scheduler_heartbeat_sec` to 10, here is a much better stable output with the expected frequency:
   
   ```
   python3 listen_to_statsd.py
   2020-12-11 22:22:39.028670 [0 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:22:49.041901 [10.013231 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:22:59.052456 [10.010555 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:23:09.062201 [10.009745 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:23:19.071650 [10.009449 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:23:29.084219 [10.012569 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:23:39.093269 [10.00905 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:23:49.106148 [10.012879 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:23:59.113249 [10.007101 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:24:09.124079 [10.01083 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:24:19.135419 [10.01134 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:24:29.143558 [10.008139 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:24:39.154385 [10.010827 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 22:24:49.164710 [10.010325 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   ```
   
   **Anything else we need to know**:
   
   Yes, you are awesome, but you might already know this.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-743484731


   Could you try this on 2.0.0rc1?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-744416184


   `pip install 'apache-airflow[statsd]==2.0.0rc2'` (there's an rc2 out now, soon to be an rc3)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-743486354


   This seems  better on 2.0.0rc1:
   
   Testing with 2s (a _bit_ noisy)
   
   ```
   2020-12-11 23:50:15.752318 [2.145703 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:18.657482 [2.905164 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:20.763591 [2.106109 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:23.759480 [2.995889 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:25.896743 [2.137263 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:28.718250 [2.821507 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:30.867545 [2.149295 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:33.780438 [2.912893 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:35.899214 [2.118776 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:38.892609 [2.993395 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:41.038221 [2.145612 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:43.891272 [2.853051 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:46.034377 [2.143105 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   ```
   
   Test with 1s
   ```
   2020-12-11 23:50:54.327609 [1.036089 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:55.402327 [1.074718 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:56.480900 [1.078573 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:58.312181 [1.831281 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:50:59.357452 [1.045271 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:51:00.435100 [1.077648 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:51:01.514030 [1.07893 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:51:03.437901 [1.923871 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   2020-12-11 23:51:04.510359 [1.072458 seconds] received message: b'airflow.scheduler_heartbeat:1|c'.
   ```
   
   So possible we can close this -- but you tell me.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafidka commented on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
rafidka commented on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-744151548


   What is the easiest way to install Airflow 2.0.0rc1 and StatsD dependencies? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafidka commented on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
rafidka commented on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-747639539


   I repeated the experiment on Airflow 2.0.0RC2 and produced more statistical data which you can find in the Excel sheet below, but here is a summary:
   
   scheduler_heartbeat_sec value|Average Frequency of scheduler_heartbeat metric
   -|-
   1|2.52057172
   2|3.85960122
   3|4.76818162
   5|6.49515588
   10|11.36376658
   30|30.89824894
   
   I tried this on the same machine I mentioned above (Amazon r5.4xlarge machine) so it is pretty powerful and I did confirm there isn't much load on the CPU (below 10% which is mainly the use of Airflow). I can retry this on a personal laptop if you feel you don't have strong confidence about results generated from a single machine.
   
   I cannot tell whether this is just a metrics issue or not, but I did look at the code and I do feel it is an actual scheduling issue, not just metrics (though I must admit my understanding of Airflow code base is still limited.) In my opinion, this justifies some investigation to see what is going on. In particular, I would like to suggest:
   
   1. Investigate whether this is just a metrics issue or indeed an issue with the scheduler.
   2. If it is a metric issue, I think it is important to fix. The scheduler_heartbeat is an important metric that can be used to judge the health of the system.
   3. If it is not just a metric issue, then that's probably even more important 😊 
   4. Admittedly, the higher the scheduler_heartbet_sec value, the more accurate the metric is (which suggests this is an actual scheduling issue not just metrics). So if no fix is intended, then at least the default value in airflow.cfg should be updated to, say, 30 seconds. In fact, if no fix is intended, I would even suggest putting a minimum value on `scheduler_heartbat_sec` config, or at least log a warning if the user specify a low value; there isn't much point in allowing the user to specify 5 seconds when they will get an average of 6.5 instead.
   
   I can help with this investigation if you agree with me that it is important to do (though probably won't be able to do so before the new year). Otherwise, feel free to resolve (though I still think at least point 4 above is important if we think that scheduler interval accuracy is not important.)
   
   ## Statistical Data for Different Runs
   
   Below is a snapshot of the Excel sheet I mentioned above. I can upload the Excel sheet file itself if you like, in which case please advise where I should upload it to.
   
   ![image](https://user-images.githubusercontent.com/442447/102529079-44c31c00-4054-11eb-81c5-08a4a17efac1.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafidka edited a comment on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
rafidka edited a comment on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-747639539


   I repeated the experiment on Airflow 2.0.0RC2 and produced more statistical data which you can find in the Excel sheet below, but here is a summary:
   
   scheduler_heartbeat_sec value|Average Frequency of scheduler_heartbeat metric
   -|-
   1|2.52057172
   2|3.85960122
   3|4.76818162
   5|6.49515588
   10|11.36376658
   30|30.89824894
   
   I tried this on the same machine I mentioned above (Amazon r5.4xlarge machine) so it is pretty powerful and I did confirm there isn't much load on the CPU (below 10% which is mainly the use of Airflow). I can retry this on a personal laptop if you feel you don't have strong confidence about results generated from a single machine (which, admittedly, I also feel so.)
   
   I cannot tell whether this is just a metrics issue or not, but I did look at the code and I do feel it is an actual scheduling issue, not just metrics (though I must admit my understanding of Airflow code base is still limited.) In my opinion, this justifies some investigation to see what is going on. In particular, I would like to suggest:
   
   1. Investigate whether this is just a metrics issue or indeed an issue with the scheduler.
   2. If it is a metric issue, I think it is important to fix. The scheduler_heartbeat is an important metric that can be used to judge the health of the system.
   3. If it is not just a metric issue, then that's probably even more important 😊 
   4. Admittedly, the higher the scheduler_heartbet_sec value, the more accurate the metric is (which suggests this is an actual scheduling issue not just metrics). So if no fix is intended, then at least the default value in airflow.cfg should be updated to, say, 30 seconds. In fact, if no fix is intended, I would even suggest putting a minimum value on `scheduler_heartbat_sec` config, or at least log a warning if the user specify a low value; there isn't much point in allowing the user to specify 5 seconds when they will get an average of 6.5 instead.
   
   I can help with this investigation if you agree with me that it is important to do (though probably won't be able to do so before the new year). Otherwise, feel free to resolve (though I still think at least point 4 above is important if we think that scheduler interval accuracy is not important.)
   
   ## Statistical Data for Different Runs
   
   Below is a snapshot of the Excel sheet I mentioned above. I can upload the Excel sheet file itself if you like, in which case please advise where I should upload it to.
   
   ![image](https://user-images.githubusercontent.com/442447/102529079-44c31c00-4054-11eb-81c5-08a4a17efac1.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafidka commented on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
rafidka commented on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-743492762


   It seems much better on Airflow 2.0.0rc1, though -interestingly- with 2s value it is less accurate than with 1s; one would think that 1s should be less accurate because the scheduler loop is more likely to take more than the heartbeat interval. I will give it a try with different values and update the issue with my findings accordingly, but if you feel this is not an issue, feel free to resolve.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-743485530


   OH I just saw the nice repro script you gave! Trying this myself.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] rafidka commented on issue #13026: Scheduler heartbeat adds one second if it is is configured to be 5 or less

Posted by GitBox <gi...@apache.org>.
rafidka commented on issue #13026:
URL: https://github.com/apache/airflow/issues/13026#issuecomment-751955610


   I repeated the experiment using Airflow 2.0 on my personal laptop, which is a relatively powerful laptop with Intel 10750H 5.0 GHz 6-core/12-thread, 16 GB RAM. Below are some samples:
   
   scheduler_heartbeat_sec value|Sample Frequency of scheduler_heartbeat metric
   -|-
   1|1.14, 1,55, 1,68
   2|2.64, 2.26, 2.86, 2.24
   3|3.99, 3.44, 3.96, 4.04, 4.02, 3.4
   5|5.13, 5.14, 5.14, 5.17
   10|10.28, 10.27, 10.29, 10.23, 10.27
   30|30.29, 30.29, 30.27, 30.20, 30.29, 30.20
   
   For some reason the 2 and 3 are noticeably bad. The rest are better, but there is still some 0.1~0.3 sec lag.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org