You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2022/10/19 11:33:36 UTC

[GitHub] [airflow] stromal opened a new issue, #27131: Scheduled runtime and actual runtime totally different

stromal opened a new issue, #27131:
URL: https://github.com/apache/airflow/issues/27131

   ### Apache Airflow version
   
   2.4.1
   
   ### What happened
   
   # Situation
   
   - I have a DAG file that is scheduled to rune at 1:30 am UTC 
   - but in the logs it shows that it have started running at 2022-10-19, 08:09:12 UTC
   - It have run for about the same period as before and delivered the data correctly from point A to B
   - I haven't modified this DAG file for days
   - there is nothing else scheduled for this time period at all
   - If I look at the previouse logs sometimes it star running at 1:30 am UTC sometimes at totally different random times 
      - (these are Saturday and Sunday runs, 
      - I am the only one to acces to airflow and the files, 
      - noone else modified them, 
      - and I defiantly not touched or triggered them on Saturday and Sunday) 
   
   # System
   
   - ubuntu 20.04 LTS
   - Docker Airflow
   - AWS EC2
   - x86 AMD / x86 Intel CPU
   
   # Dockerfile
   
   ```
   FROM apache/airflow:2.4.1-python3.8
   
   # ExternalPythonOperator
   # Compulsory to switch parameter to use your own venve via ExternalPythonOperator
   ENV PIP_USER=false
   
   #python venv setup that you have created locally
   RUN python3 -m venv /opt/airflow/my_virtual_env
   
   # Install dependencies:
   COPY airflow/requirements.txt .
   # copy over folder that containes a my own local pip package
   COPY airflow/my_own_pip_package_folder /opt/airflow/my_own_pip_package_folder
   # copy over ahtever file
   COPY my_file.py /opt/airflow/my_file.py
   
   # install packages in my own docker virtual environemnt
   RUN /opt/airflow/my_virtual_env/bin/pip install -r requirements.txt
   RUN /opt/airflow/my_virtual_env/bin/pip install /opt/airflow/my_own_pip_package_folder
   RUN /opt/airflow/my_virtual_env/bin/pip install 'apache-airflow==2.4.1' --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.4.1/constraints-3.8.txt"
   
   ```
   
   
   
   
   
   
   # My DAG File
   
   ```
   """
   Example DAG demonstrating the usage of the TaskFlow API to execute Python functions natively and within a
   virtual environment.
   """
   from __future__ import annotations
   
   import logging
   import os
   import shutil
   import sys
   import tempfile
   import time
   from pprint import pprint
   
   import pendulum
   
   from airflow import DAG
   from airflow.decorators import task
   
   log = logging.getLogger(__name__)
   PYTHON = sys.executable
   BASE_DIR = tempfile.gettempdir()
   
   
   with DAG(
       dag_id='my_dag',
       schedule='30 1 * * *', # IT IS IN UTC. EX.: 11:12 UTC = 12:12am BST = 12:12am UK time
       start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
       catchup=False,
       tags=['Redshift'],
       ) as dag:
   
       @task.external_python(task_id="my_dag_task", python='/opt/airflow/my_virtual_env/bin/python3')
       def go():
           """
           Example function that will be performed in a virtual environment.
           Importing at the module level ensures that it will not attempt to import the
           library before it is installed.
           """
          # my imports
          # my code
   
       external_python_task = go()
   ```
   
   
   # Logs
   
   ## Incorrect time run log, but rune without failure (Wednesday early morning)
   
   ```
   4d37309d53a3
   *** Reading local file: /opt/airflow/logs/dag_id=my_dag/run_id=scheduled__2022-10-18T01:30:00+00:00/task_id=my_dag_task/attempt=1.log
   [2022-10-19, 08:09:12 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: my_dag.my_dag_task scheduled__2022-10-18T01:30:00+00:00 [queued]>
   [2022-10-19, 08:09:12 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: my_dag.my_dag_task scheduled__2022-10-18T01:30:00+00:00 [queued]>
   [2022-10-19, 08:09:12 UTC] {taskinstance.py:1362} INFO - 
   --------------------------------------------------------------------------------
   [2022-10-19, 08:09:12 UTC] {taskinstance.py:1363} INFO - Starting attempt 1 of 1
   [2022-10-19, 08:09:12 UTC] {taskinstance.py:1364} INFO - 
   --------------------------------------------------------------------------------
   [2022-10-19, 08:09:12 UTC] {taskinstance.py:1383} INFO - Executing <Task(_PythonExternalDecoratedOperator): my_dag_task> on 2022-10-18 01:30:00+00:00
   [2022-10-19, 08:09:12 UTC] {standard_task_runner.py:54} INFO - Started process 58 to run task
   [2022-10-19, 08:09:12 UTC] {standard_task_runner.py:82} INFO - Running: ['airflow', 'tasks', 'run', 'my_dag', 'my_dag_task', 'scheduled__2022-10-18T01:30:00+00:00', '--job-id', '235', '--raw', '--subdir', 'DAGS_FOLDER/my_dag.py', '--cfg-path', '/tmp/tmpps6y22oi']
   [2022-10-19, 08:09:12 UTC] {standard_task_runner.py:83} INFO - Job 235: Subtask my_dag_task
   [2022-10-19, 08:09:12 UTC] {dagbag.py:525} INFO - Filling up the DagBag from /opt/airflow/dags/my_dag.py
   [2022-10-19, 08:09:12 UTC] {task_command.py:384} INFO - Running <TaskInstance: my_dag.my_dag_task scheduled__2022-10-18T01:30:00+00:00 [running]> on host 4d37309d53a3
   [2022-10-19, 08:09:13 UTC] {taskinstance.py:1590} INFO - Exporting the following env vars:
   AIRFLOW_CTX_DAG_OWNER=airflow
   AIRFLOW_CTX_DAG_ID=my_dag
   AIRFLOW_CTX_TASK_ID=my_dag_task
   AIRFLOW_CTX_EXECUTION_DATE=2022-10-18T01:30:00+00:00
   AIRFLOW_CTX_TRY_NUMBER=1
   AIRFLOW_CTX_DAG_RUN_ID=scheduled__2022-10-18T01:30:00+00:00
   [2022-10-19, 08:09:16 UTC] {process_utils.py:179} INFO - Executing cmd: /opt/airflow/my_virtual_env/bin/python3 /tmp/tmdkd47ow0r/script.py /tmp/tmdkd47ow0r/script.in /tmp/tmdkd47ow0r/script.out /tmp/tmdkd47ow0r/string_args.txt
   [2022-10-19, 08:09:16 UTC] {process_utils.py:183} INFO - Output:
   [2022-10-19, 08:09:18 UTC] {process_utils.py:187} INFO - WARNING:root:/opt/airflow/logs/scheduler/latest already exists as a dir/file. Skip creating symlink.
   [2022-10-19, 08:09:22 UTC] {process_utils.py:187} INFO - startdate: 2022-10-18T00:00:00.000 enddate: 2022-10-18T23:59:59.000
   [2022-10-19, 08:09:22 UTC] {process_utils.py:187} INFO - [2022-10-19 08:09:22,282] {credentials.py:1108} INFO - Found credentials from IAM Role: BI-Read-Creds-2
   [2022-10-19, 08:09:30 UTC] {python.py:177} INFO - Done. Returned value was: [True, 'schema.table_sql']
   [2022-10-19, 08:09:30 UTC] {taskinstance.py:1401} INFO - Marking task as SUCCESS. dag_id=my_dag, task_id=my_dag_task, execution_date=20221018T013000, start_date=20221019T080912, end_date=20221019T080930
   [2022-10-19, 08:09:31 UTC] {local_task_job.py:164} INFO - Task exited with return code 0
   [2022-10-19, 08:09:31 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
   
   ```
   
   
   ## Correct run log, but rune without failure (Tuesday evening)
   
   ```
   9adeff0b1844
   *** Reading local file: /opt/airflow/logs/dag_id=my_dag/run_id=scheduled__2022-10-17T01:30:00+00:00/task_id=my_dag_task/attempt=1.log
   [2022-10-18, 01:30:00 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: my_dag.my_dag_task scheduled__2022-10-17T01:30:00+00:00 [queued]>
   [2022-10-18, 01:30:00 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: my_dag.my_dag_task scheduled__2022-10-17T01:30:00+00:00 [queued]>
   [2022-10-18, 01:30:00 UTC] {taskinstance.py:1362} INFO - 
   --------------------------------------------------------------------------------
   [2022-10-18, 01:30:00 UTC] {taskinstance.py:1363} INFO - Starting attempt 1 of 1
   [2022-10-18, 01:30:00 UTC] {taskinstance.py:1364} INFO - 
   --------------------------------------------------------------------------------
   [2022-10-18, 01:30:00 UTC] {taskinstance.py:1383} INFO - Executing <Task(_PythonExternalDecoratedOperator): my_dag_task> on 2022-10-17 01:30:00+00:00
   [2022-10-18, 01:30:00 UTC] {standard_task_runner.py:54} INFO - Started process 23538 to run task
   [2022-10-18, 01:30:00 UTC] {standard_task_runner.py:82} INFO - Running: ['airflow', 'tasks', 'run', 'my_dag', 'my_dag_task', 'scheduled__2022-10-17T01:30:00+00:00', '--job-id', '212', '--raw', '--subdir', 'DAGS_FOLDER/my_dag.py', '--cfg-path', '/tmp/tmp4aa6swro']
   [2022-10-18, 01:30:00 UTC] {standard_task_runner.py:83} INFO - Job 212: Subtask my_dag_task
   [2022-10-18, 01:30:00 UTC] {dagbag.py:525} INFO - Filling up the DagBag from /opt/airflow/dags/my_dag.py
   [2022-10-18, 01:30:01 UTC] {task_command.py:384} INFO - Running <TaskInstance: my_dag.my_dag_task scheduled__2022-10-17T01:30:00+00:00 [running]> on host 9adeff0b1844
   [2022-10-18, 01:30:01 UTC] {taskinstance.py:1590} INFO - Exporting the following env vars:
   AIRFLOW_CTX_DAG_OWNER=airflow
   AIRFLOW_CTX_DAG_ID=my_dag
   AIRFLOW_CTX_TASK_ID=my_dag_task
   AIRFLOW_CTX_EXECUTION_DATE=2022-10-17T01:30:00+00:00
   AIRFLOW_CTX_TRY_NUMBER=1
   AIRFLOW_CTX_DAG_RUN_ID=scheduled__2022-10-17T01:30:00+00:00
   [2022-10-18, 01:30:02 UTC] {process_utils.py:179} INFO - Executing cmd: /opt/airflow/my_virtual_env/bin/python3 /tmp/tmdebmcqmsm/script.py /tmp/tmdebmcqmsm/script.in /tmp/tmdebmcqmsm/script.out /tmp/tmdebmcqmsm/string_args.txt
   [2022-10-18, 01:30:02 UTC] {process_utils.py:183} INFO - Output:
   [2022-10-18, 01:30:03 UTC] {process_utils.py:187} INFO - WARNING:root:/opt/airflow/logs/scheduler/latest already exists as a dir/file. Skip creating symlink.
   [2022-10-18, 01:30:06 UTC] {process_utils.py:187} INFO - startdate: 2022-10-17T00:00:00.000 enddate: 2022-10-17T23:59:59.000
   [2022-10-18, 01:30:06 UTC] {process_utils.py:187} INFO - [2022-10-18 01:30:06,223] {credentials.py:1108} INFO - Found credentials from ......[0m
   [2022-10-18, 01:30:17 UTC] {python.py:177} INFO - Done. Returned value was: [True, 'schema.table_sql']
   [2022-10-18, 01:30:17 UTC] {taskinstance.py:1401} INFO - Marking task as SUCCESS. dag_id=my_dag, task_id=my_dag_task, execution_date=20221017T013000, start_date=20221018T013000, end_date=20221018T013017
   [2022-10-18, 01:30:17 UTC] {local_task_job.py:164} INFO - Task exited with return code 0
   [2022-10-18, 01:30:17 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check
   ```
   
   
   
   ### What you think should happen instead
   
   Start the file run at the specific period even if it will fail
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Ubuntu 20.04 LTS
   
   ### Versions of Apache Airflow Providers
   
   IMAGE extended: ```apache/airflow:2.4.1-python3.8```` 
   
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk closed issue #27131: Scheduled runtime and actual runtime totally different

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #27131: Scheduled runtime and actual runtime totally different
URL: https://github.com/apache/airflow/issues/27131


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] potiuk commented on issue #27131: Scheduled runtime and actual runtime totally different

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #27131:
URL: https://github.com/apache/airflow/issues/27131#issuecomment-1291208983

   i think this is misunderstanding. I do not fully understand what you tried to do and what was wrong, because you mention both "scheduled" runs and "triggering" and those two are completely different. "Triggered" tasks are run when you trigger them. Scheduled tasks run when their schedule comes. Regarding the first two runs, I think there was a case that just enabled task run twice (this is what **could** have happened) with the first two runs.
   
   All the other runs I assume you triggered manually so they - correctly - started as soon you triggered them without looking at the schedule (this is how triggered tasks work)
   
   unless i completely misunderstood what you wanted to explain. In which case I suggest you try to explain it more precisely. 
   
   Converting it to a discussion untill more information is provided.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org