You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/03/05 23:50:12 UTC

[GitHub] [airflow] arkadiuszbach commented on issue #10860: Timeouts in Airflow scheduler logs on AKS

arkadiuszbach commented on issue #10860:
URL: https://github.com/apache/airflow/issues/10860#issuecomment-791796674


   On Azure LoadBalancers have tcp idle timeout set as default to 4 minutes(it is visible in JSON View), so if you connect and don't interact with it in more than 4 minutes it will drop the connection.
   
   Airflow is using Kuberenetes watcher in order to monitor pod events and it is using it in stream mode.
   So It connects and waits for events, if there are no events in more than 4 minutes the LoadBalancer drops connection, but watcher is still listening for events and don't even know that connection was dropped
   
   When you add:
   ```
   -name: AIRFLOW__KUBERNETES__KUBE_CLIENT_REQUEST_ARGS
     value: '{"_request_timeout" : [60, 60]}'
   ```
   Then if the watcher did not get any events in more than 60 seconds the Read Timeout happens, but this time disconnect is one the client side(Airflow) - there is a `while(true)` in the code, so it will connect again and that is what you can see in the logs when it says "and now my watch begins"
   
   I tried the solution with _request_timeout, it works, but i didn't like these errors in the logs, so i looked and found following, which is pretty the same(it involves LoadBalancers) and describes it in more detail: https://www.finbourne.com/blog/the-mysterious-hanging-client-tcp-keep-alives
   
   So the solution is to add TCP keep alive, it will probe LoadBalancer and the idle timeout will not be triggered, even if for some reason LoadBalancer will disconnect,  keep alive will probe it and if it does not respond(for example 3 times over 60seconds) it will simply disconnect and connect again
   
   More information about keepalive:
   https://stackoverflow.com/questions/1480236/does-a-tcp-socket-connection-have-a-keep-alive
   
   With the help from solution above: https://github.com/maganaluis/k8s-api-python i was able to make it work.
   So i just downloaded the airflow version i had from pypi(Airflow 1.10.14) took the airflow file from `aiflow/bin` and after 
   `if __name__ == '__main__':` added:
   ```
       import socket
       from urllib3 import connection
       connection.HTTPConnection.default_socket_options += [
           (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
           (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
           (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 60),
           (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 3)
       ]
   ``` 
   renamed this file to airflow_custom_start.py and added it in the AIRFLOW_HOME directory inside my Airflow Docker image, then in the entrypoint.sh i just started scheduler not by using:
    `airflow scheduler` command, 
   but:
    `python $AIRFLOW_HOME/airflow_custom_start.py`
   
   Also remember to remove request_timemout otherwise it will keep disconnecting


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org