You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/05/24 18:35:23 UTC

[GitHub] [airflow] gaom25 opened a new issue #16029: Worker exited prematurely SIGTERM

gaom25 opened a new issue #16029:
URL: https://github.com/apache/airflow/issues/16029


   **Apache Airflow version**: 2.0.1
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AWS EC2 t3.xlarge
   - **OS** (e.g. from /etc/os-release): Debian
   - **Kernel** (e.g. `uname -a`): 4.19.0-16-cloud-amd64 #1 SMP Debian 4.19.181-1 (2021-03-19) x86_64 GNU/Linux
   - **Install tools**: docker standalone
   - **Others**:
   
   **What happened**:
   
   When running a Dag with 25 dynamic tasks that are basically a BashOperators that execute `sleep 10` some tasks are get killed externally, the worker logs looks like this 
   ```May 24 17:26:51 ip-172-21-74-208 docker[32419]: [2021-05-24 17:26:51,709: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM) Job: 18.')
   May 24 17:26:51 ip-172-21-74-208 docker[32419]: Traceback (most recent call last):
   May 24 17:26:51 ip-172-21-74-208 docker[32419]:   File "/usr/local/lib/python3.7/dist-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
   May 24 17:26:51 ip-172-21-74-208 docker[32419]:     human_status(exitcode), job._job),
   May 24 17:26:51 ip-172-21-74-208 docker[32419]: billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM) Job: 18.
   ```
   
   And scheduler logs show this:
   ```
   May 24 17:26:52 ip-172-21-74-201 docker[21979]: {"asctime": "2021-05-24 17:26:52,697", "name": "airflow.jobs.scheduler_job.SchedulerJob", "filename": "scheduler_job.py", "lineno": 1256, "levelname": "ERROR", "message": "Executor reports task instance <TaskInstance: vm_in_machine_dag.vm_in_machine_task-18 2021-05-24 17:24:20+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?"}
   May 24 17:26:52 ip-172-21-74-201 docker[21979]: {"asctime": "2021-05-24 17:26:52,699", "name": "airflow.processor_manager", "filename": "dag_processing.py", "lineno": 621, "levelname": "DEBUG", "message": "Received {'full_filepath': '/opt/nonroot/airflow/dags/airflow-dag-templates--vm_in_machine/airflow-sample-dags/vm_in_machine/busy_task.py', 'msg': 'Executor reports task instance <TaskInstance: vm_in_machine_dag.vm_in_machine_task-18 2021-05-24 17:24:20+00:00 [queued]> finished (failed) although the task says its queued. (Info: None) Was the task killed externally?', 'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object at 0x7fea607a44a8>, 'is_failure_callback': True} signal from DagFileProcessorAgent"}
   ```
   
   **What you expected to happen**:
   
   That the DAG runs correctly without any task fails because the usage of machine resources is below of 50% either memory or cpu
   
   **How to reproduce it**:
   
   This is the DAG code 
   ```
   from airflow.operators.bash_operator import BashOperator
   from airflow.models import DAG
   import logging
   from datetime import timedelta
   
   logger = logging.getLogger('airflow.task')
   default_args = {
       'depends_on_past': False,
       'start_date': '2021-01-21',
       'retries': 0,
       'retry_delay': timedelta(minutes=1),
       'email_on_failure': False,
       'email_on_retry': False,
       'owner': 'airflow'
   }
   dag = DAG(
       dag_id="vm_in_machine_dag",
       default_args=default_args,
       schedule_interval='@once',
       catchup=False)
   
   
   def dynamic_tasks():
       tasks_count = 25
       for i in range(tasks_count):
           BashOperator(
               task_id="vm_in_machine_task-"+str(i),
               depends_on_past=False,
               bash_command="sleep 10s",
               dag=dag)
   
   
   dynamic_tasks()
   
   ```
   To reproduce it launch 3 t3.xlarge EC2 instances, where each instance can see the other, and launch 7 instances of airflow where each instance will have an airflow component, that means, for instance 1 all the webservers, for instance 2 all the scheduler and for instance 3 all the workers
   
   Then launch all the dags at the same time, we use this bash script to trigger all dags at the same time 
   ```
   #!/bin/bash
     
   
   dags_ids=("vm_in_machine_dag")
   
   for i in $(docker ps -f name=APPUSE --format "{{.Names}}" | awk '{print $1}')
   do
     for j in "${dags_ids[@]}"; do
       docker exec -e AA=$j $i /bin/bash -c 'airflow dags unpause $AA; airflow dags trigger $AA' &
     done
   done
   echo "Done"
   
   ```
   We call each airflow instance with the same prefix for example for Airflow instance 1 it the components will be:
   * APPUSEA-af-ws: webserver
   * APPUSEA-af-sch: scheduler
   * APPUSEA-af-wkr: worker
   
   For that the bash script filter the container images for name that starts with APPUSE
   
   After a while, like 5 min there will be some task with failed state, maybe it could have logged or doesn't, and maybe the log will say that the task is marked as SUCCESS
   
   **Anything else we need to know**:
   
   There is a thread in throubleshooting slack channel, this is the link https://apache-airflow.slack.com/archives/CCQ7EGB1P/p1620925239197200
   
   Lastly we were playing with the configuration changing the values of :
   * scheduler_health_check_threshold
   * scheduler_heartbeat_sec
   * job_heartbeat_sec
   * web_server_worker_timeout
   * web_server_master_timeout
   increasing the values in each run and right now we are having values related to 420 seconds for health_check, 300 seconds for heartbeat and so on.
   
   The rest of the values are default values
   
   With this we try to know which are the limits of instances that a t3.xlarge EC2 instances can handle for each Airflow component before failing, and we create a basic dag that stress the parallelism of the machine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #16029: Worker exited prematurely SIGTERM

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #16029:
URL: https://github.com/apache/airflow/issues/16029#issuecomment-847247893


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org