You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "Gaurav Sehgal (Jira)" <ji...@apache.org> on 2019/12/15 19:00:00 UTC

[jira] [Commented] (AIRFLOW-4424) Scheduler does not terminate after num_runs when executor is KubernetesExecutor

    [ https://issues.apache.org/jira/browse/AIRFLOW-4424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996849#comment-16996849 ] 

Gaurav Sehgal commented on AIRFLOW-4424:
----------------------------------------

Hi, At GoJek, we are facing the same issue with the local executor. Here's the thread dump. 

```
# ThreadID: 140356901611264
File: "/usr/local/lib/python3.7/threading.py", line 890, in _bootstrap
  self._bootstrap_inner()
File: "/usr/local/lib/python3.7/threading.py", line 926, in _bootstrap_inner
  self.run()
File: "<string>", line 167, in run
File: "/usr/local/lib/python3.7/code.py", line 232, in interact
  more = self.push(line)
File: "/usr/local/lib/python3.7/code.py", line 258, in push
  more = self.runsource(source, self.filename)
File: "/usr/local/lib/python3.7/code.py", line 74, in runsource
  self.runcode(code)
File: "/usr/local/lib/python3.7/code.py", line 90, in runcode
  exec(code, self.locals)
File: "<console>", line 3, in <module># ThreadID: 140358376056576
File: "/usr/local/bin/airflow", line 37, in <module>
  args.func(args)
File: "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line 74, in wrapper
  return f(*args, **kwargs)
File: "/usr/local/lib/python3.7/site-packages/airflow/bin/cli.py", line 1042, in scheduler
  job.run()
File: "/usr/local/lib/python3.7/site-packages/airflow/jobs/base_job.py", line 222, in run
  self._execute()
File: "/usr/local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1356, in _execute
  self._execute_helper()
File: "/usr/local/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py", line 1496, in _execute_helper
  self.executor.end()
File: "/usr/local/lib/python3.7/site-packages/airflow/executors/local_executor.py", line 233, in end
  self.impl.end()
File: "/usr/local/lib/python3.7/site-packages/airflow/executors/local_executor.py", line 212, in end
  self.queue.join()
File: "<string>", line 2, in join
File: "/usr/local/lib/python3.7/multiprocessing/managers.py", line 819, in _callmethod
  kind, result = conn.recv()
File: "/usr/local/lib/python3.7/multiprocessing/connection.py", line 250, in recv
  buf = self._recv_bytes()
File: "/usr/local/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
  buf = self._recv(4)
File: "/usr/local/lib/python3.7/multiprocessing/connection.py", line 379, in _recv
  chunk = read(handle, remaining)

```

> Scheduler does not terminate after num_runs when executor is KubernetesExecutor
> -------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4424
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4424
>             Project: Apache Airflow
>          Issue Type: Bug
>          Components: executors, scheduler
>    Affects Versions: 1.10.3
>         Environment: EKS, deployed with stable airflow helm chart
>            Reporter: Brian Nutt
>            Priority: Blocker
>              Labels: kubernetes
>             Fix For: 2.0.0
>
>
> When using the executor like the CeleryExecutor and num_runs is set on the scheduler, the scheduler pod restarts after num runs have completed. After switching to KubernetesExecutor, the scheduler logs:
> [2019-04-26 19:20:43,562] \{{kubernetes_executor.py:770}} INFO - Shutting down Kubernetes executor
> However, the scheduler process does not complete. This leads to the scheduler pod never restarting and running num_runs again. Resulted in having to roll back to CeleryExecutor because if num_runs is -1, the scheduler builds up tons of defunct processes, which is eventually making tasks not able to be scheduled as the underlying nodes have run out of file descriptors.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)