You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/08/28 18:22:06 UTC

[GitHub] [airflow] michalmisiewicz opened a new issue #10636: Kubernetes executors hangs on pod submission

michalmisiewicz opened a new issue #10636:
URL: https://github.com/apache/airflow/issues/10636


   **Apache Airflow version**: 1.10.10, 1.10.11, 1.10.12
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`): 1.15.11, 1.17.7
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AKS
   - **Others**: Python 3.6, Python 3.7, Python 3.8
   
   Kubernetes executors hangs from time to time on worker pod submission. Pod creation request is timeouting after 15 minutes and scheduler loop continues. I have recreated the problem in several Kubernetes / Python / Airflow version configurations. 
   
   py-spy dump:
   ```
   Process 6: /usr/local/bin/python /usr/local/bin/airflow scheduler
   Python v3.7.9 (/usr/local/bin/python3.7)
   
   Thread 6 (idle): "MainThread"
       read (ssl.py:929)
       recv_into (ssl.py:1071)
       readinto (socket.py:589)
       _read_status (http/client.py:271)
       begin (http/client.py:310)
       getresponse (http/client.py:1369)
       _make_request (urllib3/connectionpool.py:421)
       urlopen (urllib3/connectionpool.py:677)
       urlopen (urllib3/poolmanager.py:336)
       request_encode_body (urllib3/request.py:171)
       request (urllib3/request.py:80)
       request (kubernetes/client/rest.py:170)
       POST (kubernetes/client/rest.py:278)
       request (kubernetes/client/api_client.py:388)
       __call_api (kubernetes/client/api_client.py:176)
       call_api (kubernetes/client/api_client.py:345)
       create_namespaced_pod_with_http_info (kubernetes/client/api/core_v1_api.py:6265)
       create_namespaced_pod (kubernetes/client/api/core_v1_api.py:6174)
       run_pod_async (airflow/contrib/kubernetes/pod_launcher.py:81)
       run_next (airflow/contrib/executors/kubernetes_executor.py:486)
       sync (airflow/contrib/executors/kubernetes_executor.py:878)
       heartbeat (airflow/executors/base_executor.py:134)
       _validate_and_run_task_instances (airflow/jobs/scheduler_job.py:1505)
       _execute_helper (airflow/jobs/scheduler_job.py:1443)
       _execute (airflow/jobs/scheduler_job.py:1382)
       run (airflow/jobs/base_job.py:221)
       scheduler (airflow/bin/cli.py:1040)
       wrapper (airflow/utils/cli.py:75)
       <module> (airflow:37)
   ```
   logs:
   ```
   020-08-26 18:26:25,721] {base_executor.py:122} DEBUG - 0 running task instances
   [2020-08-26 18:26:25,722] {base_executor.py:123} DEBUG - 1 in queue
   [2020-08-26 18:26:25,722] {base_executor.py:124} DEBUG - 32 open slots
   [2020-08-26 18:26:25,722] {kubernetes_executor.py:840} INFO - Add task ('ProcessingTask', 'exec_spark_notebook', datetime.datetime(2020, 8, 26, 18, 26, 21, 61159, tzinfo=<
   Timezone [UTC]>), 1) with command ['airflow', 'run', 'ProcessingTask', 'exec_spark_notebook', '2020-08-26T18:26:21.061159+00:00', '--local', '--pool', 'default_pool', '-sd
   ', '/usr/local/airflow/dags/qubole_processing.py'] with executor_config {}
   [2020-08-26 18:26:25,723] {base_executor.py:133} DEBUG - Calling the <class 'airflow.contrib.executors.kubernetes_executor.KubernetesExecutor'> sync method
   [2020-08-26 18:26:25,723] {kubernetes_executor.py:848} DEBUG - self.running: {('ProcessingTask', 'exec_spark_notebook', datetime.datetime(2020, 8, 26, 18, 26, 21, 61159, t
   zinfo=<Timezone [UTC]>), 1): ['airflow', 'run', 'ProcessingTask', 'exec_spark_notebook', '2020-08-26T18:26:21.061159+00:00', '--local', '--pool', 'default_pool', '-sd', '/
   usr/local/airflow/dags/qubole_processing.py']}
   [2020-08-26 18:26:25,725] {kubernetes_executor.py:471} INFO - Kubernetes job is (('ProcessingTask', 'exec_spark_notebook', datetime.datetime(2020, 8, 26, 18, 26, 21, 61159
   , tzinfo=<Timezone [UTC]>), 1), ['airflow', 'run', 'ProcessingTask', 'exec_spark_notebook', '2020-08-26T18:26:21.061159+00:00', '--local', '--pool', 'default_pool', '-sd',
    '/usr/local/airflow/dags/qubole_processing.py'], KubernetesExecutorConfig(image=None, image_pull_policy=None, request_memory=None, request_cpu=None, limit_memory=None, limi
   t_cpu=None, limit_gpu=None, gcp_service_account_key=None, node_selectors=None, affinity=None, annotations={}, volumes=[], volume_mounts=[], tolerations=None, labels={}))
   [2020-08-26 18:26:25,725] {kubernetes_executor.py:474} DEBUG - Kubernetes running for command ['airflow', 'run', 'ProcessingTask', 'exec_spark_notebook', '2020-08-26T18:26
   :21.061159+00:00', '--local', '--pool', 'default_pool', '-sd', '/usr/local/airflow/dags/qubole_processing.py']
   [2020-08-26 18:26:25,726] {kubernetes_executor.py:475} DEBUG - Kubernetes launching image dpadevairflowacr01.azurecr.io/airflow:1.10.10-20200826-v2
   [2020-08-26 18:26:25,729] {pod_launcher.py:79} DEBUG - Pod Creation Request:
   {{ POD JSON }}
   [2020-08-26 18:26:26,003] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5546)
   [2020-08-26 18:26:26,612] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor779-Process, stopped)>
   [2020-08-26 18:26:28,148] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5553)
   [2020-08-26 18:26:28,628] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor780-Process, stopped)>
   [2020-08-26 18:26:31,473] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5560)
   [2020-08-26 18:26:32,005] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor781-Process, stopped)>
   [2020-08-26 18:26:32,441] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5567)
   [2020-08-26 18:26:33,017] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor782-Process, stopped)>
   [2020-08-26 18:26:37,501] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5578)
   [2020-08-26 18:26:38,044] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor784-Process, stopped)>
   [2020-08-26 18:26:38,510] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5585)
   [2020-08-26 18:26:39,054] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor785-Process, stopped)>
   [2020-08-26 18:26:39,481] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5574)
   [2020-08-26 18:26:40,057] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor783-Process, stopped)>
   [2020-08-26 18:26:44,549] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5599)
   [2020-08-26 18:26:45,108] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor787-Process, stopped)>
   [2020-08-26 18:26:45,613] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5606)
   [2020-08-26 18:26:46,118] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor788-Process, stopped)>
   [2020-08-26 18:26:48,742] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5595)
   [2020-08-26 18:26:49,127] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor786-Process, stopped)>
   [2020-08-26 18:26:50,596] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5616)
   [2020-08-26 18:26:51,151] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor789-Process, stopped)>
   [2020-08-26 18:26:51,653] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5623)
   [2020-08-26 18:26:52,161] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor790-Process, stopped)>
   [2020-08-26 18:26:54,664] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5630)
   [2020-08-26 18:26:55,179] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor791-Process, stopped)>
   [2020-08-26 18:26:56,645] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5637)
   [2020-08-26 18:26:57,194] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor792-Process, stopped)>
   [2020-08-26 18:26:57,592] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5644)
   [2020-08-26 18:26:58,207] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor793-Process, stopped)>
   [2020-08-26 18:27:00,809] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5651)
   [2020-08-26 18:27:01,599] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor794-Process, stopped)>
   [2020-08-26 18:27:03,124] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5658)
   [2020-08-26 18:27:03,615] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor795-Process, stopped)>
   [2020-08-26 18:27:04,120] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5665)
   [2020-08-26 18:27:04,627] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor796-Process, stopped)>
   [2020-08-26 18:27:07,167] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5672)
   [2020-08-26 18:27:07,642] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor797-Process, stopped)>
   [2020-08-26 18:27:09,125] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5679)
   [2020-08-26 18:27:09,654] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor798-Process, stopped)>
   [2020-08-26 18:27:10,201] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5686)
   [2020-08-26 18:27:10,664] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor799-Process, stopped)>
   [2020-08-26 18:27:15,149] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5697)
   [2020-08-26 18:27:15,706] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor801-Process, stopped)>
   [2020-08-26 18:27:16,128] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5704)
   [2020-08-26 18:27:16,717] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor802-Process, stopped)>
   [2020-08-26 18:27:18,167] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5693)
   [2020-08-26 18:27:18,725] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor800-Process, stopped)>
   [2020-08-26 18:27:21,200] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5714)
   [2020-08-26 18:27:21,751] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor803-Process, stopped)>
   [2020-08-26 18:27:22,221] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5721)
   [2020-08-26 18:27:22,760] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor804-Process, stopped)>
   [2020-08-26 18:27:24,192] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5728)
   [2020-08-26 18:27:24,773] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor805-Process, stopped)>
   [2020-08-26 18:27:27,249] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5735)
   [2020-08-26 18:27:27,787] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor806-Process, stopped)>
   [2020-08-26 18:27:28,246] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5742)
   [2020-08-26 18:27:28,798] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor807-Process, stopped)>
   [2020-08-26 18:27:30,318] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5749)
   [2020-08-26 18:27:30,810] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor808-Process, stopped)>
   [2020-08-26 18:27:33,747] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5756)
   [2020-08-26 18:27:34,260] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor809-Process, stopped)>
   [2020-08-26 18:27:34,670] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5763)
   [2020-08-26 18:27:35,271] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor810-Process, stopped)>
   [2020-08-26 18:27:36,765] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5770)
   [2020-08-26 18:27:37,286] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor811-Process, stopped)>
   [2020-08-26 18:27:39,802] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5777)
   [2020-08-26 18:27:40,304] {scheduler_job.py:268} DEBUG - Waiting for <Process(DagFileProcessor812-Process, stopped)>
   [2020-08-26 18:27:45,820] {settings.py:278} DEBUG - Disposing DB connection pool (PID 5784)
   [2020-08-26 18:42:04,209] {kubernetes_executor.py:885} WARNING - HTTPError when attempting to run task, re-queueing. Exception: HTTPSConnectionPool(host='10.2.0.1', port=443
   ): Read timed out. (read timeout=None)
   [2020-08-26 18:42:04,211] {scheduler_job.py:1450} DEBUG - Heartbeating the scheduler
   [2020-08-26 18:42:04,352] {base_job.py:200} DEBUG - [heartbeat]
   [2020-08-26 18:42:04,353] {scheduler_job.py:1459} DEBUG - Ran scheduling loop in 938.71 seconds
   [2020-08-26 18:42:04,353] {scheduler_job.py:1462} DEBUG - Sleeping for 1.00 seconds
   [2020-08-26 18:42:05,354] {scheduler_job.py:1425} DEBUG - Starting Loop...
   [2020-08-26 18:42:05,355] {scheduler_job.py:1436} DEBUG - Harvesting DAG parsing results
   [2020-08-26 18:42:05,355] {dag_processing.py:648} DEBUG - Received message of type SimpleDag
   [2020-08-26 18:42:05,355] {dag_processing.py:648} DEBUG - Received message of type DagParsingStat
   [2020-08-26 18:42:05,356] {dag_processing.py:648} DEBUG - Received message of type DagParsingStat
   ```
   
   How I can configure Airflow to emit logs from imported packages ? I would like to check `urllib3` and `http.client` logs in order to understand the problem. Airflow scheduler logs shows only Airflow codebase logs. 
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] michalmisiewicz commented on issue #10636: Kubernetes executors hangs on pod submission

Posted by GitBox <gi...@apache.org>.
michalmisiewicz commented on issue #10636:
URL: https://github.com/apache/airflow/issues/10636#issuecomment-706587853


   @dimberman Requests were hanging due to idle connection time-out on Azure Firewall. Timeout is set to [4 minutes](https://docs.microsoft.com/en-us/azure/firewall/firewall-faq#what-is-the-tcp-idle-timeout-for-azure-firewall). Kubernetes lib under the hood is using urllib3 PoolManager which reuse connections. On timeout RST package is send which cause urllib3 to hang indefinitely. I'm going to provide fix for it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman commented on issue #10636: Kubernetes executors hangs on pod submission

Posted by GitBox <gi...@apache.org>.
dimberman commented on issue #10636:
URL: https://github.com/apache/airflow/issues/10636#issuecomment-692876443


   @michalmisiewicz Are you sure that this is an airflow problem and not a Kubernetes problem? This looks like there's a timeout while pinging the Kube API (Which I believe is common as we had to build workarounds to keep timeouts from crashing the JobWatcher)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] dimberman closed issue #10636: Kubernetes executors hangs on pod submission

Posted by GitBox <gi...@apache.org>.
dimberman closed issue #10636:
URL: https://github.com/apache/airflow/issues/10636


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org