You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/08/31 12:35:04 UTC
[GitHub] [airflow] dan-origami edited a comment on issue #16703: Workers silently crash after memory build up

dan-origami edited a comment on issue #16703:
URL: https://github.com/apache/airflow/issues/16703#issuecomment-909191767


    @kaxil @ephraimbuddy Just to give you a bit of an update, I think I have found what the actual cause of this is.
   
   I noticed that we seem to hit a problem with the number of Active Tasks on a Celery Worker (all our settings here are currently default) so max 16 per worker and 32 across the airflow setup.
   
   However I noticed that when this problem manifests we don't schedule anything so started looking into our workers via Flower.
   
   Screenshots are below, but basically we have these fairly big DAGs that run some spark jobs on a spark cluster in the same kubernetes cluster (pyspark, so the driver exists as part of the airflow worker, we can do more details on spark if you want but its BashOperator and not SparkOperator for a number of reasons).
   
   Sometimes the tasks in these DAGs fail, which is picked up by Airflow as the task is marked as Failed. However these tasks sit on the Celery worker as an active task still and are not removed.
   
   We can manually delete them and it works, so the celery worker itself is still active and not crashed. They just do not seem to log anything when they are not picking up any new tasks/running them. Active PIDs etc as listed in Flower also seem to match up.
   
   It's not clear why the Task failed but  we have the logs of it being picked up by the Worker (i've removed a few bits).
   
   It also explains why I was down the memory/resource issue rabbithole as these tasks sit around on the worker(s).
   
   There are some parameters that we can tune I think to include timeouts on the tasks and stuff on the Celery side, do you know if there is any known issues with this disconnect between a Failed Task in Airflow and it not being removed from the Celery Worker?
   
   The worker was not rebooted/crashed at any point during this time.
   
   ```
   [2021-08-28 06:54:43,865: INFO/ForkPoolWorker-8] Executing command in Celery: ['airflow', 'tasks', 'run', 'dagname', 'timeseries_api', '2021-08-28T06:18:00+00:00', '--local', '--pool', 'default_pool', '--subdir', 'dagfolder']
   [2021-08-28 06:54:44,122: WARNING/ForkPoolWorker-8] Running <TaskInstance:dagname.timeseries_api 2021-08-28T06:18:00+00:00 [queued]> on host airflow-worker-2.airflow-worker.airflow.svc.cluster.local
   ```
   
   <img width="579" alt="Screenshot 2021-08-31 at 13 16 40" src="https://user-images.githubusercontent.com/51330721/131501535-c74321ef-04a1-46ba-b260-032ac52089f0.png">
   
   <img width="1666" alt="Screenshot 2021-08-31 at 13 17 18" src="https://user-images.githubusercontent.com/51330721/131503343-f463ccea-f8f4-45cb-a50d-d9f02b3ba2dc.png">
   
   <img width="1101" alt="Screenshot 2021-08-31 at 13 18 21" src="https://user-images.githubusercontent.com/51330721/131503356-2aafb596-e90e-4a66-b0a4-aaf73073c91d.png">
   
   <img width="1666" alt="Screenshot 2021-08-31 at 13 18 03" src="https://user-images.githubusercontent.com/51330721/131503371-ca565fe7-9bb2-43d2-89fc-2929f89c71d4.png">
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org