You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "nclaeys (via GitHub)" <gi...@apache.org> on 2023/03/08 11:02:41 UTC

[GitHub] [airflow] nclaeys opened a new issue, #29974: Inconsistent behavior of EmptyOperator between start and end tasks

nclaeys opened a new issue, #29974:
URL: https://github.com/apache/airflow/issues/29974

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   We are using Airflow 2.4.3.
   
   When looking at the documentation for the EmptyOperator, it says explicitly that it is never processed by the executor. 
   However what I notice is that in our cases it differs between start and end EmptyOperators. The start tasks are not processed by the executor but for some reason the end tasks are for some reason.
   
   This results in unexpected behavior and is inefficient as it creates a pod on kubernetes in our case for no reason. Additionally, it causes some weird behavior in our lineage graphs.
   
   For the start task we see no logs:
   ```
   *** Log file does not exist: /opt/airflow/logs/dag_id=dbt-datahub/run_id=scheduled__2023-03-07T00:00:00+00:00/task_id=initial_task_start/attempt=1.log
   *** Fetching from: http://:8793/log/dag_id=dbt-datahub/run_id=scheduled__2023-03-07T00:00:00+00:00/task_id=initial_task_start/attempt=1.log
   *** Failed to fetch log file from worker. Request URL is missing an 'http://' or 'https://' protocol.
   ```
   
   ```
   dbtdatahubend-dc6d51700abc41e0974b46caafd857ac
   *** Reading local file: /opt/airflow/logs/dag_id=dbt-datahub/run_id=manual__2023-03-07T16:56:07.937548+00:00/task_id=end/attempt=1.log
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: dbt-datahub.end manual__2023-03-07T16:56:07.937548+00:00 [queued]>
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:1165} INFO - Dependencies all met for <TaskInstance: dbt-datahub.end manual__2023-03-07T16:56:07.937548+00:00 [queued]>
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:1362} INFO - 
   --------------------------------------------------------------------------------
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:1363} INFO - Starting attempt 1 of 1
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:1364} INFO - 
   --------------------------------------------------------------------------------
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:1383} INFO - Executing <Task(EmptyOperator): end> on 2023-03-07 16:56:07.937548+00:00
   [2023-03-07, 16:56:31 UTC] {standard_task_runner.py:55} INFO - Started process 19 to run task
   [2023-03-07, 16:56:31 UTC] {standard_task_runner.py:82} INFO - Running: ['airflow', 'tasks', 'run', 'dbt-datahub', 'end', 'manual__2023-03-07T16:56:07.937548+00:00', '--job-id', '24', '--raw', '--subdir', 'DAGS_FOLDER/dbt-datahub/dbt-datahub.py', '--cfg-path', '/tmp/tmpdr42kl3k']
   [2023-03-07, 16:56:31 UTC] {standard_task_runner.py:83} INFO - Job 24: Subtask end
   [2023-03-07, 16:56:31 UTC] {task_command.py:376} INFO - Running <TaskInstance: dbt-datahub.end manual__2023-03-07T16:56:07.937548+00:00 [running]> on host dbtdatahubend-dc6d51700abc41e0974b46caafd857ac
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:1590} INFO - Exporting the following env vars:
   AIRFLOW_CTX_DAG_OWNER=Conveyor
   AIRFLOW_CTX_DAG_ID=dbt-datahub
   AIRFLOW_CTX_TASK_ID=end
   AIRFLOW_CTX_EXECUTION_DATE=2023-03-07T16:56:07.937548+00:00
   AIRFLOW_CTX_TRY_NUMBER=1
   AIRFLOW_CTX_DAG_RUN_ID=manual__2023-03-07T16:56:07.937548+00:00
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:1401} INFO - Marking task as SUCCESS. dag_id=dbt-datahub, task_id=end, execution_date=20230307T165607, start_date=20230307T165631, end_date=20230307T165631
   [2023-03-07, 16:56:31 UTC] {base.py:71} INFO - Using connection ID 'datahub_rest_default' for task execution.
   [2023-03-07, 16:56:31 UTC] {base.py:71} INFO - Using connection ID 'datahub_rest_default' for task execution.
   [2023-03-07, 16:56:31 UTC] {_plugin.py:147} INFO - Emitting Datahub Dataflow: DataFlow(urn=<datahub.utilities.urns.data_flow_urn.DataFlowUrn object at 0x7fb9ced397c0>, id='dbt-datahub', orchestrator='airflow', cluster='prod', name=None, description='None\n\n', properties={'_access_control': 'None', '_default_view': "'grid'", 'catchup': 'True', 'fileloc': "'/opt/airflow/dags/dbt-datahub/dbt-datahub.py'", 'is_paused_upon_creation': 'None', 'start_date': 'None', 'tags': '[]', 'timezone': "Timezone('UTC')"}, url='https://app.dev.datafy.cloud/environments/datahubtest/airflow/tree?dag_id=dbt-datahub', tags=set(), owners={'Conveyor'})
   [2023-03-07, 16:56:31 UTC] {_plugin.py:165} INFO - Emitting Datahub Datajob: DataJob(id='end', urn=<datahub.utilities.urns.data_job_urn.DataJobUrn object at 0x7fb9cecbbfa0>, flow_urn=<datahub.utilities.urns.data_flow_urn.DataFlowUrn object at 0x7fb9cecbf910>, name=None, description=None, properties={'depends_on_past': 'False', 'email': '[]', 'label': "'end'", 'execution_timeout': 'None', 'sla': 'None', 'task_id': "'end'", 'trigger_rule': "<TriggerRule.ALL_SUCCESS: 'all_success'>", 'wait_for_downstream': 'False', 'downstream_task_ids': 'set()', 'inlets': '[]', 'outlets': '[]'}, url='https://app.dev.datafy.cloud/environments/datahubtest/airflow/taskinstance/list/?flt1_dag_id_equals=dbt-datahub&_flt_3_task_id=end', tags=set(), owners={'Conveyor'}, group_owners=set(), inlets=[], outlets=[], upstream_urns=[<datahub.utilities.urns.data_job_urn.DataJobUrn object at 0x7fb9cecbbc10>])
   [2023-03-07, 16:56:31 UTC] {_plugin.py:179} INFO - Emitted Start Datahub Dataprocess Instance: DataProcessInstance(id='dbt-datahub_end_manual__2023-03-07T16:56:07.937548+00:00', urn=<datahub.utilities.urns.data_process_instance_urn.DataProcessInstanceUrn object at 0x7fb9cecbb040>, orchestrator='airflow', cluster='prod', type='BATCH_AD_HOC', template_urn=<datahub.utilities.urns.data_job_urn.DataJobUrn object at 0x7fb9cecbbfa0>, parent_instance=None, properties={'run_id': 'manual__2023-03-07T16:56:07.937548+00:00', 'duration': '0.163779', 'start_date': '2023-03-07 16:56:31.157871+00:00', 'end_date': '2023-03-07 16:56:31.321650+00:00', 'execution_date': '2023-03-07 16:56:07.937548+00:00', 'try_number': '1', 'hostname': 'dbtdatahubend-dc6d51700abc41e0974b46caafd857ac', 'max_tries': '0', 'external_executor_id': 'None', 'pid': '19', 'state': 'success', 'operator': 'EmptyOperator', 'priority_weight': '1', 'unixname': 'airflow', 'log_url': 'https://app.dev.datafy.cloud/environments/datahu
 btest/airflow/log?execution_date=2023-03-07T16%3A56%3A07.937548%2B00%3A00&task_id=end&dag_id=dbt-datahub&map_index=-1'}, url='https://app.dev.datafy.cloud/environments/datahubtest/airflow/log?execution_date=2023-03-07T16%3A56%3A07.937548%2B00%3A00&task_id=end&dag_id=dbt-datahub&map_index=-1', inlets=[], outlets=[], upstream_urns=[])
   [2023-03-07, 16:56:31 UTC] {_plugin.py:191} INFO - Emitted Completed Data Process Instance: DataProcessInstance(id='dbt-datahub_end_manual__2023-03-07T16:56:07.937548+00:00', urn=<datahub.utilities.urns.data_process_instance_urn.DataProcessInstanceUrn object at 0x7fb9ced39700>, orchestrator='airflow', cluster='prod', type='BATCH_SCHEDULED', template_urn=<datahub.utilities.urns.data_job_urn.DataJobUrn object at 0x7fb9cecbbfa0>, parent_instance=None, properties={}, url=None, inlets=[], outlets=[], upstream_urns=[])
   [2023-03-07, 16:56:31 UTC] {local_task_job.py:159} INFO - Task exited with return code 0
   [2023-03-07, 16:56:31 UTC] {taskinstance.py:2623} INFO - 0 downstream tasks scheduled from follow-on schedule check
   ```
   
   
   ### What you think should happen instead
   
   I expect it to be consistent and that no matter whether the EmptyOperator is in your dag, the same behavior is observed (it is never processed by the executor-.
   
   ### How to reproduce
   
   Create 1 dag containing: 
   - a start emptyOperator task
   - a random task (in our case a simple containerTask)
   - an end emptyOperator task
   
   ### Operating System
   
   kubernetes
   
   ### Versions of Apache Airflow Providers
   
   apache-airflow-providers-amazon==6.0.0
   apache-airflow-providers-celery==3.0.0
   apache-airflow-providers-cncf-kubernetes==4.0.2
   apache-airflow-providers-common-sql==1.3.3
   apache-airflow-providers-docker==3.2.0
   apache-airflow-providers-elasticsearch==4.2.1
   apache-airflow-providers-ftp==3.3.0
   apache-airflow-providers-google==8.4.0
   apache-airflow-providers-grpc==3.0.0
   apache-airflow-providers-hashicorp==3.1.0
   apache-airflow-providers-http==4.1.1
   apache-airflow-providers-imap==3.1.1
   apache-airflow-providers-microsoft-azure==4.3.0
   apache-airflow-providers-mysql==3.2.1
   apache-airflow-providers-odbc==3.1.2
   apache-airflow-providers-opsgenie==3.1.0
   apache-airflow-providers-postgres==5.2.2
   apache-airflow-providers-redis==3.0.0
   apache-airflow-providers-sendgrid==3.0.0
   apache-airflow-providers-sftp==4.1.0
   apache-airflow-providers-slack==4.2.3
   apache-airflow-providers-sqlite==3.3.1
   apache-airflow-providers-ssh==3.2.0
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Deployment details
   
   /
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #29974: Inconsistent behavior of EmptyOperator between start and end tasks

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on issue #29974:
URL: https://github.com/apache/airflow/issues/29974#issuecomment-1460995301

   @nclaeys issue is fixed in https://github.com/apache/airflow/pull/29979
   Sadly it missed the cut for Airflow 2.5.2RC1 so it will have to wait for 2.5.3 release


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #29974: Inconsistent behavior of EmptyOperator between start and end tasks

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on issue #29974:
URL: https://github.com/apache/airflow/issues/29974#issuecomment-1460277521

   cool so now that we know the root cause lets see if we can fix it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] nclaeys commented on issue #29974: Inconsistent behavior of EmptyOperator between start and end tasks

Posted by "nclaeys (via GitHub)" <gi...@apache.org>.
nclaeys commented on issue #29974:
URL: https://github.com/apache/airflow/issues/29974#issuecomment-1460274670

   @eladkal Good hunch, indeed if I disable the mini scheduler, the end task (EmptyOperator) is not executed by the executor and the behavior is the same as the start task.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal closed issue #29974: Inconsistent behavior of EmptyOperator between start and end tasks

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal closed issue #29974: Inconsistent behavior of EmptyOperator between start and end tasks
URL: https://github.com/apache/airflow/issues/29974


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] eladkal commented on issue #29974: Inconsistent behavior of EmptyOperator between start and end tasks

Posted by "eladkal (via GitHub)" <gi...@apache.org>.
eladkal commented on issue #29974:
URL: https://github.com/apache/airflow/issues/29974#issuecomment-1460168464

   Just a theory. I wonder if this is as a result of mini scheduler optimization. Maybe the mini scheduler does not consider EmptyOperator case?
   Can you try to set [`schedule_after_task_execution = False`](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#schedule-after-task-execution) and check if this still happens?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] nclaeys commented on issue #29974: Inconsistent behavior of EmptyOperator between start and end tasks

Posted by "nclaeys (via GitHub)" <gi...@apache.org>.
nclaeys commented on issue #29974:
URL: https://github.com/apache/airflow/issues/29974#issuecomment-1465687972

   @eladkal Thanks a lot for your help! Looking forward to 2.5.3 then :wink:  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org