You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/12/13 07:49:52 UTC

[GitHub] [airflow] nguyenmphu opened a new issue, #28328: Scheduler pod hang when K8s API call fail

nguyenmphu opened a new issue, #28328:
URL: https://github.com/apache/airflow/issues/28328

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   Airflow version: `2.3.4`
   
   I have deployed airflow with the official Helm in K8s with `KubernetesExecutor`. Sometimes the scheduler hang when calling  K8s API. The log:
   ``` bash
   ERROR - Exception when executing Executor.end
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 752, in _execute
       self._run_scheduler_loop()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 842, in _run_scheduler_loop
       self.executor.heartbeat()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/base_executor.py", line 171, in heartbeat
       self.sync()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 649, in sync
       next_event = self.event_scheduler.run(blocking=False)
     File "/usr/local/lib/python3.8/sched.py", line 151, in run
       action(*argument, **kwargs)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/utils/event_scheduler.py", line 36, in repeat
       action(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 673, in _check_worker_pods_pending_timeout
       for pod in pending_pods().items:
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 15697, in list_namespaced_pod
       return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 15812, in list_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
       return self.__call_api(resource_path, method,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
       response_data = self.request(
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 373, in request
       return self.rest_client.GET(url,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 240, in GET
       return self.request("GET", url,
     File "/home/airflow/.local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 213, in request
       r = self.pool_manager.request(method, url,
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 74, in request
       return self.request_encode_url(
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/request.py", line 96, in request_encode_url
       return self.urlopen(method, url, **extra_kw)
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/poolmanager.py", line 376, in urlopen
       response = conn.urlopen(method, u.request_uri, **kw)
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 815, in urlopen
       return self.urlopen(
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
       httplib_response = self._make_request(
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 386, in _make_request
       self._validate_conn(conn)
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
       conn.connect()
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 358, in connect
       self.sock = conn = self._new_conn()
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
       conn = connection.create_connection(
     File "/home/airflow/.local/lib/python3.8/site-packages/urllib3/util/connection.py", line 85, in create_connection
       sock.connect(sa)
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 182, in _exit_gracefully
       sys.exit(os.EX_OK)
   SystemExit: 0
   During handling of the above exception, another exception occurred:
   Traceback (most recent call last):
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 773, in _execute
       self.executor.end()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 823, in end
       self._flush_task_queue()
     File "/home/airflow/.local/lib/python3.8/site-packages/airflow/executors/kubernetes_executor.py", line 776, in _flush_task_queue
       self.log.debug('Executor shutting down, task_queue approximate size=%d', self.task_queue.qsize())
     File "<string>", line 2, in qsize
     File "/usr/local/lib/python3.8/multiprocessing/managers.py", line 835, in _callmethod
       kind, result = conn.recv()
     File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 250, in recv
       buf = self._recv_bytes()
     File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
       buf = self._recv(4)
     File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 379, in _recv
       chunk = read(handle, remaining)
   ConnectionResetError: [Errno 104] Connection reset by peer
   ```
   Then the executor process was killed and the pod was still running. But the scheduler does not work.
   
   After restarting, the scheduler worked usually.
   
   ### What you think should happen instead
   
   When the error occurs, the executor needs to auto restart or the scheduler should be killed. 
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28328: Scheduler pod hang when K8s API call fail

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28328:
URL: https://github.com/apache/airflow/issues/28328#issuecomment-1367867640

   I think no restart is needed. This error seems to be raised (from stacktrace) when everything is finished and simply the connection is reset by a thread that reads it (and it should be simply ignored)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #28328: Scheduler pod hang when K8s API call fail

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #28328: Scheduler pod hang when K8s API call fail
URL: https://github.com/apache/airflow/issues/28328


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] boring-cyborg[bot] commented on issue #28328: Scheduler pod hang when K8s API call fail

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #28328:
URL: https://github.com/apache/airflow/issues/28328#issuecomment-1347880144

   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] maxnathaniel commented on issue #28328: Scheduler pod hang when K8s API call fail

Posted by GitBox <gi...@apache.org>.
maxnathaniel commented on issue #28328:
URL: https://github.com/apache/airflow/issues/28328#issuecomment-1367824018

   @potiuk would like to take on this `good first issue`. How should the error be handled in this scenario? Restart the executor? Presumably, closed connection refers to the call via `self.kube_client.list_namespaced_pod`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #28328: Scheduler pod hang when K8s API call fail

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #28328:
URL: https://github.com/apache/airflow/issues/28328#issuecomment-1366189875

   Looks like handling error while reading output on closed connection could be done better.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org