You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@airflow.apache.org by Reed Villanueva <rv...@ucera.org> on 2020/01/30 05:13:46 UTC

How often is dag definition file read during a single dag run?

How often is a dag definition file read during a single dag run?

Have a large dag that takes long amount of time to build (~1-3min). Looking
at the logs of each task as the dag is running it appears that the dag
definition file is being executed for every task before it runs...

*** Reading local file:
/home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log[2020-01-29
19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met for
<TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00
[queued]>[2020-01-29 19:02:10,866] {taskinstance.py:655} INFO -
Dependencies all met for <TaskInstance:
mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>[2020-01-29
19:02:10,866] {taskinstance.py:866} INFO -
--------------------------------------------------------------------------------[2020-01-29
19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of
1[2020-01-29 19:02:10,866] {taskinstance.py:868} INFO -
--------------------------------------------------------------------------------[2020-01-29
19:02:10,883] {taskinstance.py:887} INFO - Executing
<Task(BashOperator): precheck_db_perms> on
2020-01-30T04:51:34.621883+00:00[2020-01-29 19:02:10,887]
{standard_task_runner.py:52} INFO - Started process 140570 to run
task[2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO -
[2020-01-29 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag
from /home/airflow/airflow/dags/mydag.py[2020-01-29 19:02:11,052]
{logging_mixin.py:112} INFO - <output from my dag definition
file>[2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more
output from my dag definition file>............[2020-01-29
19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host %s
<TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00
[running]> airflowetl.co.local[2020-01-29 19:02:58,674]
{bash_operator.py:81} INFO - Tmp dir root location:
 /tmp[2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting
the following env vars:
AIRFLOW_CTX_DAG_EMAIL=me@co.org
AIRFLOW_CTX_DAG_OWNER=me
AIRFLOW_CTX_DAG_ID=mydag
AIRFLOW_CTX_TASK_ID=mytask
AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00
AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00[2020-01-29
19:02:58,675] {bash_operator.py:105} INFO - Temporary script location:
/tmp/airflowtmphwu1ckty/mytaskbmnsizw5<only now does the actual task
logic output seem to start>

where the first whole part of the log seems to imply that the dag file is
being run each time a new task is run (I see this for every task).

Is this indeed what is happening here? Is this normal / expected behavior?
Note that since my dag takes some time to build, this would mean that that
time is being multiplied across every task in the dag (of which there are
many in this case), which makes me think this is either not normal or there
is some best practice I am not using here. Could anyone with more airflow
experience help explain what I'm seeing here?

-- 
This electronic message is intended only for the named 
recipient, and may 
contain information that is confidential or 
privileged. If you are not the 
intended recipient, you are 
hereby notified that any disclosure, copying, 
distribution or 
use of the contents of this message is strictly 
prohibited. If 
you have received this message in error or are not the 
named
recipient, please notify us immediately by contacting the 
sender at 
the electronic mail address noted above, and delete 
and destroy all copies 
of this message. Thank you.

Re: How often is dag definition file read during a single dag run?

Posted by Reed Villanueva <rv...@ucera.org>.
Interesting, thanks for the info.

On Thu, Jan 30, 2020 at 11:55 AM Shaw, Damian P. <
damian.shaw.2@credit-suisse.com> wrote:

> It’s not that the task cares about the DAB structure it’s that the tasks
> only exist as part of the DAG, the only way to get a task in Airflow is to
> get the DAG first. At least as far I know.
>
>
>
> Damian
>
>
>
> *From:* Reed Villanueva <rv...@ucera.org>
> *Sent:* Thursday, January 30, 2020 15:28
> *To:* users@airflow.apache.org
> *Subject:* Re: How often is dag definition file read during a single dag
> run?
>
>
>
> Thanks for the clarification.
>
>
>
> Yes, ultimately pickling the configs I use to build the graph was what I
> did (the graph is created in a loop that does DB queries to make DAG
> braches for a set of tables, so queries are involved ans was causing
> overhead b/c the queries were being done for every single task when only
> really needed to do it once to make the config dict for the DAG).
>
>
>
> Could you explain a bit as to why any task would even care about the DAG
> structure? I would think that if the scheduler sets them to run, then they
> should run and have no reason to care about the overall structure of the
> DAG.
>
>
>
> On Thu, Jan 30, 2020 at 5:33 AM Shaw, Damian P. <
> damian.shaw.2@credit-suisse.com> wrote:
>
> Yes, every task is run in process isolation (and could be running across
> separate machines) so every tasks builds the DAG from scratch.
>
>
>
> If you don’t expect your DAG to change across an amount of time and they
> run on the same machine you could cache / pickle the DAG object and before
> trying to build the DAG check if the cache / pickle file is available and
> recent and load it from there. Or I am sure there are many other
> solutions.
>
>
>
> Damian
>
>
>
> *From:* Reed Villanueva <rv...@ucera.org>
> *Sent:* Thursday, January 30, 2020 00:14
> *To:* users@airflow.apache.org
> *Subject:* How often is dag definition file read during a single dag run?
>
>
>
> How often is a dag definition file read during a single dag run?
>
> Have a large dag that takes long amount of time to build (~1-3min).
> Looking at the logs of each task as the dag is running it appears that the
> dag definition file is being executed for every task before it runs...
>
> *** Reading local file: /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log
>
> [2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
>
> [2020-01-29 19:02:10,866] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
>
> [2020-01-29 19:02:10,866] {taskinstance.py:866} INFO -
>
> --------------------------------------------------------------------------------
>
> [2020-01-29 19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of 1
>
> [2020-01-29 19:02:10,866] {taskinstance.py:868} INFO -
>
> --------------------------------------------------------------------------------
>
> [2020-01-29 19:02:10,883] {taskinstance.py:887} INFO - Executing <Task(BashOperator): precheck_db_perms> on 2020-01-30T04:51:34.621883+00:00
>
> [2020-01-29 19:02:10,887] {standard_task_runner.py:52} INFO - Started process 140570 to run task
>
> [2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO - [2020-01-29 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag from /home/airflow/airflow/dags/mydag.py
>
> [2020-01-29 19:02:11,052] {logging_mixin.py:112} INFO - <output from my dag definition file>
>
> [2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more output from my dag definition file>
>
> ....
>
> ....
>
> ....
>
> [2020-01-29 19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00 [running]> airflowetl.co.local
>
> [2020-01-29 19:02:58,674] {bash_operator.py:81} INFO - Tmp dir root location:
>
>  /tmp
>
> [2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting the following env vars:
>
> AIRFLOW_CTX_DAG_EMAIL=me@co.org
>
> AIRFLOW_CTX_DAG_OWNER=me
>
> AIRFLOW_CTX_DAG_ID=mydag
>
> AIRFLOW_CTX_TASK_ID=mytask
>
> AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00
>
> AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00
>
> [2020-01-29 19:02:58,675] {bash_operator.py:105} INFO - Temporary script location: /tmp/airflowtmphwu1ckty/mytaskbmnsizw5
>
> <only now does the actual task logic output seem to start>
>
> where the first whole part of the log seems to imply that the dag file is
> being run each time a new task is run (I see this for every task).
>
> Is this indeed what is happening here? Is this normal / expected behavior?
> Note that since my dag takes some time to build, this would mean that that
> time is being multiplied across every task in the dag (of which there are
> many in this case), which makes me think this is either not normal or there
> is some best practice I am not using here. Could anyone with more airflow
> experience help explain what I'm seeing here?
>
>
> This electronic message is intended only for the named
> recipient, and may contain information that is confidential or
> privileged. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying, distribution or
> use of the contents of this message is strictly prohibited. If
> you have received this message in error or are not the named
> recipient, please notify us immediately by contacting the
> sender at the electronic mail address noted above, and delete
> and destroy all copies of this message. Thank you.
>
>
>
>
> ==============================================================================
> Please access the attached hyperlink for an important electronic
> communications disclaimer:
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>
> ==============================================================================
>
>
> This electronic message is intended only for the named
> recipient, and may contain information that is confidential or
> privileged. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying, distribution or
> use of the contents of this message is strictly prohibited. If
> you have received this message in error or are not the named
> recipient, please notify us immediately by contacting the
> sender at the electronic mail address noted above, and delete
> and destroy all copies of this message. Thank you.
>
>
>
>
> ==============================================================================
> Please access the attached hyperlink for an important electronic
> communications disclaimer:
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>
> ==============================================================================
>

-- 
This electronic message is intended only for the named 
recipient, and may 
contain information that is confidential or 
privileged. If you are not the 
intended recipient, you are 
hereby notified that any disclosure, copying, 
distribution or 
use of the contents of this message is strictly 
prohibited. If 
you have received this message in error or are not the 
named
recipient, please notify us immediately by contacting the 
sender at 
the electronic mail address noted above, and delete 
and destroy all copies 
of this message. Thank you.

RE: How often is dag definition file read during a single dag run?

Posted by "Shaw, Damian P. " <da...@credit-suisse.com>.
It’s not that the task cares about the DAB structure it’s that the tasks only exist as part of the DAG, the only way to get a task in Airflow is to get the DAG first. At least as far I know.

Damian

From: Reed Villanueva <rv...@ucera.org>
Sent: Thursday, January 30, 2020 15:28
To: users@airflow.apache.org
Subject: Re: How often is dag definition file read during a single dag run?

Thanks for the clarification.

Yes, ultimately pickling the configs I use to build the graph was what I did (the graph is created in a loop that does DB queries to make DAG braches for a set of tables, so queries are involved ans was causing overhead b/c the queries were being done for every single task when only really needed to do it once to make the config dict for the DAG).

Could you explain a bit as to why any task would even care about the DAG structure? I would think that if the scheduler sets them to run, then they should run and have no reason to care about the overall structure of the DAG.

On Thu, Jan 30, 2020 at 5:33 AM Shaw, Damian P. <da...@credit-suisse.com>> wrote:
Yes, every task is run in process isolation (and could be running across separate machines) so every tasks builds the DAG from scratch.

If you don’t expect your DAG to change across an amount of time and they run on the same machine you could cache / pickle the DAG object and before trying to build the DAG check if the cache / pickle file is available and recent and load it from there. Or I am sure there are many other solutions.

Damian

From: Reed Villanueva <rv...@ucera.org>>
Sent: Thursday, January 30, 2020 00:14
To: users@airflow.apache.org<ma...@airflow.apache.org>
Subject: How often is dag definition file read during a single dag run?


How often is a dag definition file read during a single dag run?

Have a large dag that takes long amount of time to build (~1-3min). Looking at the logs of each task as the dag is running it appears that the dag definition file is being executed for every task before it runs...

*** Reading local file: /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log

[2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>

[2020-01-29 19:02:10,866] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>

[2020-01-29 19:02:10,866] {taskinstance.py:866} INFO -

--------------------------------------------------------------------------------

[2020-01-29 19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of 1

[2020-01-29 19:02:10,866] {taskinstance.py:868} INFO -

--------------------------------------------------------------------------------

[2020-01-29 19:02:10,883] {taskinstance.py:887} INFO - Executing <Task(BashOperator): precheck_db_perms> on 2020-01-30T04:51:34.621883+00:00

[2020-01-29 19:02:10,887] {standard_task_runner.py:52} INFO - Started process 140570 to run task

[2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO - [2020-01-29 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag from /home/airflow/airflow/dags/mydag.py

[2020-01-29 19:02:11,052] {logging_mixin.py:112} INFO - <output from my dag definition file>

[2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more output from my dag definition file>

....

....

....

[2020-01-29 19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00 [running]> airflowetl.co.local

[2020-01-29 19:02:58,674] {bash_operator.py:81} INFO - Tmp dir root location:

 /tmp

[2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting the following env vars:

AIRFLOW_CTX_DAG_EMAIL=me@co.org<ma...@co.org>

AIRFLOW_CTX_DAG_OWNER=me

AIRFLOW_CTX_DAG_ID=mydag

AIRFLOW_CTX_TASK_ID=mytask

AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00

AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00

[2020-01-29 19:02:58,675] {bash_operator.py:105} INFO - Temporary script location: /tmp/airflowtmphwu1ckty/mytaskbmnsizw5

<only now does the actual task logic output seem to start>

where the first whole part of the log seems to imply that the dag file is being run each time a new task is run (I see this for every task).

Is this indeed what is happening here? Is this normal / expected behavior? Note that since my dag takes some time to build, this would mean that that time is being multiplied across every task in the dag (of which there are many in this case), which makes me think this is either not normal or there is some best practice I am not using here. Could anyone with more airflow experience help explain what I'm seeing here?

This electronic message is intended only for the named
recipient, and may contain information that is confidential or
privileged. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or
use of the contents of this message is strictly prohibited. If
you have received this message in error or are not the named
recipient, please notify us immediately by contacting the
sender at the electronic mail address noted above, and delete
and destroy all copies of this message. Thank you.


==============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==============================================================================

This electronic message is intended only for the named
recipient, and may contain information that is confidential or
privileged. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or
use of the contents of this message is strictly prohibited. If
you have received this message in error or are not the named
recipient, please notify us immediately by contacting the
sender at the electronic mail address noted above, and delete
and destroy all copies of this message. Thank you.



=============================================================================== 
Please access the attached hyperlink for an important electronic communications disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
=============================================================================== 

Re: How often is dag definition file read during a single dag run?

Posted by Chris Palmer <ch...@crpalmer.com>.
Fundamentally tasks are defined by the code itself, and so a worker process
can only determine what code to execute when running by parsing the python
code that defines it. Maybe there are some cases where a task can be well
defined outside of the full context of the dag that contains it but that
doesn't apply in general.

Chris

On Thu, Jan 30, 2020 at 3:31 PM Reed Villanueva <rv...@ucera.org>
wrote:

> Thanks for the clarification.
>
> Yes, ultimately pickling the configs I use to build the graph was what I
> did (the graph is created in a loop that does DB queries to make DAG
> braches for a set of tables, so queries are involved ans was causing
> overhead b/c the queries were being done for every single task when only
> really needed to do it once to make the config dict for the DAG).
>
> Could you explain a bit as to why any task would even care about the DAG
> structure? I would think that if the scheduler sets them to run, then they
> should run and have no reason to care about the overall structure of the
> DAG.
>
> On Thu, Jan 30, 2020 at 5:33 AM Shaw, Damian P. <
> damian.shaw.2@credit-suisse.com> wrote:
>
>> Yes, every task is run in process isolation (and could be running across
>> separate machines) so every tasks builds the DAG from scratch.
>>
>>
>>
>> If you don’t expect your DAG to change across an amount of time and they
>> run on the same machine you could cache / pickle the DAG object and before
>> trying to build the DAG check if the cache / pickle file is available and
>> recent and load it from there. Or I am sure there are many other
>> solutions.
>>
>>
>>
>> Damian
>>
>>
>>
>> *From:* Reed Villanueva <rv...@ucera.org>
>> *Sent:* Thursday, January 30, 2020 00:14
>> *To:* users@airflow.apache.org
>> *Subject:* How often is dag definition file read during a single dag run?
>>
>>
>>
>> How often is a dag definition file read during a single dag run?
>>
>> Have a large dag that takes long amount of time to build (~1-3min).
>> Looking at the logs of each task as the dag is running it appears that the
>> dag definition file is being executed for every task before it runs...
>>
>> *** Reading local file: /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log
>>
>> [2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
>>
>> [2020-01-29 19:02:10,866] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
>>
>> [2020-01-29 19:02:10,866] {taskinstance.py:866} INFO -
>>
>> --------------------------------------------------------------------------------
>>
>> [2020-01-29 19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of 1
>>
>> [2020-01-29 19:02:10,866] {taskinstance.py:868} INFO -
>>
>> --------------------------------------------------------------------------------
>>
>> [2020-01-29 19:02:10,883] {taskinstance.py:887} INFO - Executing <Task(BashOperator): precheck_db_perms> on 2020-01-30T04:51:34.621883+00:00
>>
>> [2020-01-29 19:02:10,887] {standard_task_runner.py:52} INFO - Started process 140570 to run task
>>
>> [2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO - [2020-01-29 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag from /home/airflow/airflow/dags/mydag.py
>>
>> [2020-01-29 19:02:11,052] {logging_mixin.py:112} INFO - <output from my dag definition file>
>>
>> [2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more output from my dag definition file>
>>
>> ....
>>
>> ....
>>
>> ....
>>
>> [2020-01-29 19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00 [running]> airflowetl.co.local
>>
>> [2020-01-29 19:02:58,674] {bash_operator.py:81} INFO - Tmp dir root location:
>>
>>  /tmp
>>
>> [2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting the following env vars:
>>
>> AIRFLOW_CTX_DAG_EMAIL=me@co.org
>>
>> AIRFLOW_CTX_DAG_OWNER=me
>>
>> AIRFLOW_CTX_DAG_ID=mydag
>>
>> AIRFLOW_CTX_TASK_ID=mytask
>>
>> AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00
>>
>> AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00
>>
>> [2020-01-29 19:02:58,675] {bash_operator.py:105} INFO - Temporary script location: /tmp/airflowtmphwu1ckty/mytaskbmnsizw5
>>
>> <only now does the actual task logic output seem to start>
>>
>> where the first whole part of the log seems to imply that the dag file is
>> being run each time a new task is run (I see this for every task).
>>
>> Is this indeed what is happening here? Is this normal / expected
>> behavior? Note that since my dag takes some time to build, this would mean
>> that that time is being multiplied across every task in the dag (of which
>> there are many in this case), which makes me think this is either not
>> normal or there is some best practice I am not using here. Could anyone
>> with more airflow experience help explain what I'm seeing here?
>>
>>
>> This electronic message is intended only for the named
>> recipient, and may contain information that is confidential or
>> privileged. If you are not the intended recipient, you are
>> hereby notified that any disclosure, copying, distribution or
>> use of the contents of this message is strictly prohibited. If
>> you have received this message in error or are not the named
>> recipient, please notify us immediately by contacting the
>> sender at the electronic mail address noted above, and delete
>> and destroy all copies of this message. Thank you.
>>
>>
>>
>>
>> ==============================================================================
>> Please access the attached hyperlink for an important electronic
>> communications disclaimer:
>> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>>
>> ==============================================================================
>>
>
> This electronic message is intended only for the named
> recipient, and may contain information that is confidential or
> privileged. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying, distribution or
> use of the contents of this message is strictly prohibited. If
> you have received this message in error or are not the named
> recipient, please notify us immediately by contacting the
> sender at the electronic mail address noted above, and delete
> and destroy all copies of this message. Thank you.
>

Re: How often is dag definition file read during a single dag run?

Posted by Reed Villanueva <rv...@ucera.org>.
Thanks for the clarification.

Yes, ultimately pickling the configs I use to build the graph was what I
did (the graph is created in a loop that does DB queries to make DAG
braches for a set of tables, so queries are involved ans was causing
overhead b/c the queries were being done for every single task when only
really needed to do it once to make the config dict for the DAG).

Could you explain a bit as to why any task would even care about the DAG
structure? I would think that if the scheduler sets them to run, then they
should run and have no reason to care about the overall structure of the
DAG.

On Thu, Jan 30, 2020 at 5:33 AM Shaw, Damian P. <
damian.shaw.2@credit-suisse.com> wrote:

> Yes, every task is run in process isolation (and could be running across
> separate machines) so every tasks builds the DAG from scratch.
>
>
>
> If you don’t expect your DAG to change across an amount of time and they
> run on the same machine you could cache / pickle the DAG object and before
> trying to build the DAG check if the cache / pickle file is available and
> recent and load it from there. Or I am sure there are many other
> solutions.
>
>
>
> Damian
>
>
>
> *From:* Reed Villanueva <rv...@ucera.org>
> *Sent:* Thursday, January 30, 2020 00:14
> *To:* users@airflow.apache.org
> *Subject:* How often is dag definition file read during a single dag run?
>
>
>
> How often is a dag definition file read during a single dag run?
>
> Have a large dag that takes long amount of time to build (~1-3min).
> Looking at the logs of each task as the dag is running it appears that the
> dag definition file is being executed for every task before it runs...
>
> *** Reading local file: /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log
>
> [2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
>
> [2020-01-29 19:02:10,866] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>
>
> [2020-01-29 19:02:10,866] {taskinstance.py:866} INFO -
>
> --------------------------------------------------------------------------------
>
> [2020-01-29 19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of 1
>
> [2020-01-29 19:02:10,866] {taskinstance.py:868} INFO -
>
> --------------------------------------------------------------------------------
>
> [2020-01-29 19:02:10,883] {taskinstance.py:887} INFO - Executing <Task(BashOperator): precheck_db_perms> on 2020-01-30T04:51:34.621883+00:00
>
> [2020-01-29 19:02:10,887] {standard_task_runner.py:52} INFO - Started process 140570 to run task
>
> [2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO - [2020-01-29 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag from /home/airflow/airflow/dags/mydag.py
>
> [2020-01-29 19:02:11,052] {logging_mixin.py:112} INFO - <output from my dag definition file>
>
> [2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more output from my dag definition file>
>
> ....
>
> ....
>
> ....
>
> [2020-01-29 19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00 [running]> airflowetl.co.local
>
> [2020-01-29 19:02:58,674] {bash_operator.py:81} INFO - Tmp dir root location:
>
>  /tmp
>
> [2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting the following env vars:
>
> AIRFLOW_CTX_DAG_EMAIL=me@co.org
>
> AIRFLOW_CTX_DAG_OWNER=me
>
> AIRFLOW_CTX_DAG_ID=mydag
>
> AIRFLOW_CTX_TASK_ID=mytask
>
> AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00
>
> AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00
>
> [2020-01-29 19:02:58,675] {bash_operator.py:105} INFO - Temporary script location: /tmp/airflowtmphwu1ckty/mytaskbmnsizw5
>
> <only now does the actual task logic output seem to start>
>
> where the first whole part of the log seems to imply that the dag file is
> being run each time a new task is run (I see this for every task).
>
> Is this indeed what is happening here? Is this normal / expected behavior?
> Note that since my dag takes some time to build, this would mean that that
> time is being multiplied across every task in the dag (of which there are
> many in this case), which makes me think this is either not normal or there
> is some best practice I am not using here. Could anyone with more airflow
> experience help explain what I'm seeing here?
>
>
> This electronic message is intended only for the named
> recipient, and may contain information that is confidential or
> privileged. If you are not the intended recipient, you are
> hereby notified that any disclosure, copying, distribution or
> use of the contents of this message is strictly prohibited. If
> you have received this message in error or are not the named
> recipient, please notify us immediately by contacting the
> sender at the electronic mail address noted above, and delete
> and destroy all copies of this message. Thank you.
>
>
>
>
> ==============================================================================
> Please access the attached hyperlink for an important electronic
> communications disclaimer:
> http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
>
> ==============================================================================
>

-- 
This electronic message is intended only for the named 
recipient, and may 
contain information that is confidential or 
privileged. If you are not the 
intended recipient, you are 
hereby notified that any disclosure, copying, 
distribution or 
use of the contents of this message is strictly 
prohibited. If 
you have received this message in error or are not the 
named
recipient, please notify us immediately by contacting the 
sender at 
the electronic mail address noted above, and delete 
and destroy all copies 
of this message. Thank you.

RE: How often is dag definition file read during a single dag run?

Posted by "Shaw, Damian P. " <da...@credit-suisse.com>.
Yes, every task is run in process isolation (and could be running across separate machines) so every tasks builds the DAG from scratch.

If you don’t expect your DAG to change across an amount of time and they run on the same machine you could cache / pickle the DAG object and before trying to build the DAG check if the cache / pickle file is available and recent and load it from there. Or I am sure there are many other solutions.

Damian

From: Reed Villanueva <rv...@ucera.org>
Sent: Thursday, January 30, 2020 00:14
To: users@airflow.apache.org
Subject: How often is dag definition file read during a single dag run?


How often is a dag definition file read during a single dag run?

Have a large dag that takes long amount of time to build (~1-3min). Looking at the logs of each task as the dag is running it appears that the dag definition file is being executed for every task before it runs...

*** Reading local file: /home/airflow/airflow/logs/mydag/mytask/2020-01-30T04:51:34.621883+00:00/1.log

[2020-01-29 19:02:10,844] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>

[2020-01-29 19:02:10,866] {taskinstance.py:655} INFO - Dependencies all met for <TaskInstance: mydag.mytask2020-01-30T04:51:34.621883+00:00 [queued]>

[2020-01-29 19:02:10,866] {taskinstance.py:866} INFO -

--------------------------------------------------------------------------------

[2020-01-29 19:02:10,866] {taskinstance.py:867} INFO - Starting attempt 1 of 1

[2020-01-29 19:02:10,866] {taskinstance.py:868} INFO -

--------------------------------------------------------------------------------

[2020-01-29 19:02:10,883] {taskinstance.py:887} INFO - Executing <Task(BashOperator): precheck_db_perms> on 2020-01-30T04:51:34.621883+00:00

[2020-01-29 19:02:10,887] {standard_task_runner.py:52} INFO - Started process 140570 to run task

[2020-01-29 19:02:11,048] {logging_mixin.py:112} INFO - [2020-01-29 19:02:11,047] {dagbag.py:403} INFO - Filling up the DagBag from /home/airflow/airflow/dags/mydag.py

[2020-01-29 19:02:11,052] {logging_mixin.py:112} INFO - <output from my dag definition file>

[2020-01-29 19:02:11,101] {logging_mixin.py:112} INFO - <more output from my dag definition file>

....

....

....

[2020-01-29 19:02:58,651] {logging_mixin.py:112} INFO - Running %s on host %s <TaskInstance: mydag.mytask 2020-01-30T04:51:34.621883+00:00 [running]> airflowetl.co.local

[2020-01-29 19:02:58,674] {bash_operator.py:81} INFO - Tmp dir root location:

 /tmp

[2020-01-29 19:02:58,674] {bash_operator.py:91} INFO - Exporting the following env vars:

AIRFLOW_CTX_DAG_EMAIL=me@co.org<ma...@co.org>

AIRFLOW_CTX_DAG_OWNER=me

AIRFLOW_CTX_DAG_ID=mydag

AIRFLOW_CTX_TASK_ID=mytask

AIRFLOW_CTX_EXECUTION_DATE=2020-01-30T04:51:34.621883+00:00

AIRFLOW_CTX_DAG_RUN_ID=manual__2020-01-30T04:51:34.621883+00:00

[2020-01-29 19:02:58,675] {bash_operator.py:105} INFO - Temporary script location: /tmp/airflowtmphwu1ckty/mytaskbmnsizw5

<only now does the actual task logic output seem to start>

where the first whole part of the log seems to imply that the dag file is being run each time a new task is run (I see this for every task).

Is this indeed what is happening here? Is this normal / expected behavior? Note that since my dag takes some time to build, this would mean that that time is being multiplied across every task in the dag (of which there are many in this case), which makes me think this is either not normal or there is some best practice I am not using here. Could anyone with more airflow experience help explain what I'm seeing here?

This electronic message is intended only for the named
recipient, and may contain information that is confidential or
privileged. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or
use of the contents of this message is strictly prohibited. If
you have received this message in error or are not the named
recipient, please notify us immediately by contacting the
sender at the electronic mail address noted above, and delete
and destroy all copies of this message. Thank you.



=============================================================================== 
Please access the attached hyperlink for an important electronic communications disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
===============================================================================