You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/07/07 09:17:00 UTC

[GitHub] [airflow] V0lantis commented on issue #10790: Copy of [AIRFLOW-5071] JIRA: Thousands of Executor reports task instance X finished (success) although the task says its queued. Was the task killed externally?

V0lantis commented on issue #10790:
URL: https://github.com/apache/airflow/issues/10790#issuecomment-1177296853

   Hello here :wave:, 
   I would very much like to help, experiencing the exact same issue with the sensor in our kubernetes cluster with Celery executor. Just to add our own experience:
   
   We are running the architecture with a shared NFS which is hosting our dags and logs (among other things). The speed and allowed throughput in the shared fileSystem is huge bottleneck for the scheduler (since it needs to parse the dags quite often). We noticed that the issue with the sensors and the log message `INFO - Task is not able to run` appeared when we spent up the entire credits AWS was allowing us. We therefore switched to a speed enhanced FileSystem (FSx) for those who are wondering, and the issue has almost disappeared, though still there (which is why I would very much like to help)
   
   But we are also experiencing a similar issue though I am not exactly sure it the same one.
   For example, sometime a task will be taken by a celery worker but will not output any logs file. Here is what I found:
   
   1. A task is in a running state for a while without anything happening.
   2. When we take a look at the log, we see the following:
   
   ```log
   *** Reading remote log from s3://my.bucket/airflow/prod/my_dag/one_of_dag_task/2022-07-03T12:00:00+00:00/1.log.
   [2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not met for <TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
   [2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not met for <TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
   [2022-07-04, 20:58:08 UTC] {local_task_job.py:99} INFO - Task is not able to be run
   
   [2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not met for <TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
   [2022-07-04, 20:58:08 UTC] {taskinstance.py:1037} INFO - Dependencies not met for <TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
   [2022-07-04, 20:58:08 UTC] {local_task_job.py:99} INFO - Task is not able to be run
   [2022-07-04, 23:21:07 UTC] {spark.py:201} INFO - Driver application_1655985151726_32839 has finished
   [2022-07-04, 23:21:07 UTC] {taskinstance.py:1288} INFO - Marking task as SUCCESS. dag_id=my_dag, task_id=one_of_dag_task, execution_date=20220703T120000, start_date=20220704T145813, end_date=20220704T232107
   [2022-07-04, 23:21:07 UTC] {local_task_job.py:154} INFO - Task exited with return code 0
   [2022-07-04, 23:21:08 UTC] {local_task_job.py:264} INFO - 1 downstream tasks scheduled from follow-on schedule check
   ```
   
   One could argue, why am I making so much for trouble if the task has been successfully run ? Because sometime, the task simply doesn't ever finish, and stay in _running_ state indefinitely, which is hard to spot (until we have some alert because the dag has a lot of delay)
   
   3. By digging through the log, I was able to find one worker who picked the task the **2022-07-04 14:58:05,058**:
   
   ```log
   prod worker [2022-07-04 14:58:05,058: INFO/MainProcess] Task airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833] received
   prod worker [2022-07-04 14:58:05,069: INFO/ForkPoolWorker-828] Executing command in Celery: ['airflow', 'tasks', 'run', 'my_dag', 'one_of_dag_task', 'scheduled__2022-07-03T12:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/my_dag/one_of_dag_task.py']
   prod worker [2022-07-04 14:58:05,929: WARNING/ForkPoolWorker-828] Running <TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00 [queued]> on host airflow-worker-cfd5b7655-7hlcz
   ```
   
   And then no log given by the worker, saying for example that the task went from `queued` to `running` state
   
   4. But then, later :
   
   ```log
   prod worker [2022-07-04 20:58:05,273: INFO/MainProcess] Task airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833] received
   prod worker [2022-07-04 20:58:05,282: INFO/ForkPoolWorker-2634] Executing command in Celery: ['airflow', 'tasks', 'run', 'my_dag', 'one_of_dag_task', 'scheduled__2022-07-03T12:00:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/my_dag/one_of_dag_task.py']
   prod worker [2022-07-04 20:58:05,282: INFO/ForkPoolWorker-2634] Celery task ID: 661f0a53-129b-4856-8db0-5632ff466833
   prod worker [2022-07-04 20:58:05,306: INFO/ForkPoolWorker-2634] Filling up the DagBag from /opt/airflow/dags/my_dag/one_of_dag_task.py
   prod worker [2022-07-04 20:58:06,023: WARNING/ForkPoolWorker-2634] Running <TaskInstance: my_dag.one_of_dag_task scheduled__2022-07-03T12:00:00+00:00 [running]> on host data-platform-airflow-worker-cfd5b7655-68n2 
   ```
   
   5. And finally in the first worker, we see:
   ```log
   data-eks-prod worker [2022-07-04 23:21:09,542: INFO/ForkPoolWorker-828] Task airflow.executors.celery_executor.execute_command[661f0a53-129b-4856-8db0-5632ff466833] succeeded in 30184.482223104016s: None
   ```
   
   (_This is a very long task basically waiting for a spark job to termiate_)
   
   Here, it ended well, but sometime, there is just nothing happening. How is it possible that no logs are generated?
   
   I am still trying to understand the complexity of Airflow to understand this issue, but I wanted to participate with what I have been able to find so far. Some more informations about our architectures:
   
   **Environment:**
   
   Cloud provider or hardware configuration: AWS
   OS (e.g. from /etc/os-release): centos rhel fedora
   Others: Redis, CeleryExecutor, Airflow 2.2.5
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org