You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Chen Michaeli <pe...@gmail.com> on 2020/09/21 21:23:19 UTC

Question regarding data usage in the DAG itself

Hello, I am using Apache Airflow for my fun and experience and it is great!
I hope I was meant to send questions to this address, please correct me if
I'm wrong.

I was wondering why I shouldn't let the DAG itself do any data gathering?

For example and for the sake of simplicity, I have a pipeline that reads a
file name from a s3 bucket, and than stores it in a mysql table.

Normally I would use one sensor or operator to get the file name, and than
a second operator to store it in mysql. (While for example using xCom to
communicate the name between them).

I understand this might be the preffered course of action, and that is what
I currently do!
However, what I don't understand is why can't I just get the file name
within the DAG itself.
Why is it considered to be a bad practice to do any data related processing
or gathering in the DAG?

I can use the AWS API to easily retrieve the file name and store it in a
regular Python "global" variable. Than I will only have one operator that
takes this file name and stores it in mysql.

Each time the DAG will be parsed for execution, my code that uses the AWS
API will run again and provide me with a new file name.

Am I missing something?

Thank you very much, this has gotten me so curious!

Re: Question regarding data usage in the DAG itself

Posted by Chen Michaeli <pe...@gmail.com>.
Oh I thought the DAG is parsed only prior to execution.

Thank you so much! :)

בתאריך יום ג׳, 22 בספט׳ 2020, 20:15, מאת Tomasz Urbaszek ‏<
turbaszek@apache.org>:

> The DAG is parsed every few seconds (by scheduler). It means that any
> top-level code is executed every few seconds. So if you will request an
> external API or database on DAG level (not in operator) it means that the
> request will be send quite often and that's definitely not an expected
> behavior :)
>
> Cheers,
> Tomek
>
> On Mon, Sep 21, 2020 at 11:23 PM Chen Michaeli <pe...@gmail.com> wrote:
>
>> Hello, I am using Apache Airflow for my fun and experience and it is
>> great!
>> I hope I was meant to send questions to this address, please correct me
>> if I'm wrong.
>>
>> I was wondering why I shouldn't let the DAG itself do any data gathering?
>>
>> For example and for the sake of simplicity, I have a pipeline that reads
>> a file name from a s3 bucket, and than stores it in a mysql table.
>>
>> Normally I would use one sensor or operator to get the file name, and
>> than a second operator to store it in mysql. (While for example using xCom
>> to communicate the name between them).
>>
>> I understand this might be the preffered course of action, and that is
>> what I currently do!
>> However, what I don't understand is why can't I just get the file name
>> within the DAG itself.
>> Why is it considered to be a bad practice to do any data related
>> processing or gathering in the DAG?
>>
>> I can use the AWS API to easily retrieve the file name and store it in a
>> regular Python "global" variable. Than I will only have one operator that
>> takes this file name and stores it in mysql.
>>
>> Each time the DAG will be parsed for execution, my code that uses the AWS
>> API will run again and provide me with a new file name.
>>
>> Am I missing something?
>>
>> Thank you very much, this has gotten me so curious!
>>
>

Re: Question regarding data usage in the DAG itself

Posted by Tomasz Urbaszek <tu...@apache.org>.
Chen take a look at `processor_poll_interval` and
`min_file_process_interval` options in airflow configuration. But
still, I would strongly recommend removing from your DAGs any
top-level code that is executed.

Cheers,
Tomek


On Sun, Oct 4, 2020 at 6:45 PM Chen Michaeli <pe...@gmail.com> wrote:
>
> Hi, a quick follow-up.
>
> Is there a parameter I can configure to alter that behavior?
>
> Say I want a specific DAG/all DAGs to be parsed every X minutes instead of the default few seconds?
>
> Thanks again :)
>
> בתאריך יום ג׳, 22 בספט׳ 2020, 20:15, מאת Tomasz Urbaszek ‏<tu...@apache.org>:
>>
>> The DAG is parsed every few seconds (by scheduler). It means that any top-level code is executed every few seconds. So if you will request an external API or database on DAG level (not in operator) it means that the request will be send quite often and that's definitely not an expected behavior :)
>>
>> Cheers,
>> Tomek
>>
>> On Mon, Sep 21, 2020 at 11:23 PM Chen Michaeli <pe...@gmail.com> wrote:
>>>
>>> Hello, I am using Apache Airflow for my fun and experience and it is great!
>>> I hope I was meant to send questions to this address, please correct me if I'm wrong.
>>>
>>> I was wondering why I shouldn't let the DAG itself do any data gathering?
>>>
>>> For example and for the sake of simplicity, I have a pipeline that reads a file name from a s3 bucket, and than stores it in a mysql table.
>>>
>>> Normally I would use one sensor or operator to get the file name, and than a second operator to store it in mysql. (While for example using xCom to communicate the name between them).
>>>
>>> I understand this might be the preffered course of action, and that is what I currently do!
>>> However, what I don't understand is why can't I just get the file name within the DAG itself.
>>> Why is it considered to be a bad practice to do any data related processing or gathering in the DAG?
>>>
>>> I can use the AWS API to easily retrieve the file name and store it in a regular Python "global" variable. Than I will only have one operator that takes this file name and stores it in mysql.
>>>
>>> Each time the DAG will be parsed for execution, my code that uses the AWS API will run again and provide me with a new file name.
>>>
>>> Am I missing something?
>>>
>>> Thank you very much, this has gotten me so curious!

Re: Question regarding data usage in the DAG itself

Posted by Chen Michaeli <pe...@gmail.com>.
Hi, a quick follow-up.

Is there a parameter I can configure to alter that behavior?

Say I want a specific DAG/all DAGs to be parsed every X minutes instead of
the default few seconds?

Thanks again :)

בתאריך יום ג׳, 22 בספט׳ 2020, 20:15, מאת Tomasz Urbaszek ‏<
turbaszek@apache.org>:

> The DAG is parsed every few seconds (by scheduler). It means that any
> top-level code is executed every few seconds. So if you will request an
> external API or database on DAG level (not in operator) it means that the
> request will be send quite often and that's definitely not an expected
> behavior :)
>
> Cheers,
> Tomek
>
> On Mon, Sep 21, 2020 at 11:23 PM Chen Michaeli <pe...@gmail.com> wrote:
>
>> Hello, I am using Apache Airflow for my fun and experience and it is
>> great!
>> I hope I was meant to send questions to this address, please correct me
>> if I'm wrong.
>>
>> I was wondering why I shouldn't let the DAG itself do any data gathering?
>>
>> For example and for the sake of simplicity, I have a pipeline that reads
>> a file name from a s3 bucket, and than stores it in a mysql table.
>>
>> Normally I would use one sensor or operator to get the file name, and
>> than a second operator to store it in mysql. (While for example using xCom
>> to communicate the name between them).
>>
>> I understand this might be the preffered course of action, and that is
>> what I currently do!
>> However, what I don't understand is why can't I just get the file name
>> within the DAG itself.
>> Why is it considered to be a bad practice to do any data related
>> processing or gathering in the DAG?
>>
>> I can use the AWS API to easily retrieve the file name and store it in a
>> regular Python "global" variable. Than I will only have one operator that
>> takes this file name and stores it in mysql.
>>
>> Each time the DAG will be parsed for execution, my code that uses the AWS
>> API will run again and provide me with a new file name.
>>
>> Am I missing something?
>>
>> Thank you very much, this has gotten me so curious!
>>
>

Re: Question regarding data usage in the DAG itself

Posted by Tomasz Urbaszek <tu...@apache.org>.
The DAG is parsed every few seconds (by scheduler). It means that any
top-level code is executed every few seconds. So if you will request an
external API or database on DAG level (not in operator) it means that the
request will be send quite often and that's definitely not an expected
behavior :)

Cheers,
Tomek

On Mon, Sep 21, 2020 at 11:23 PM Chen Michaeli <pe...@gmail.com> wrote:

> Hello, I am using Apache Airflow for my fun and experience and it is great!
> I hope I was meant to send questions to this address, please correct me if
> I'm wrong.
>
> I was wondering why I shouldn't let the DAG itself do any data gathering?
>
> For example and for the sake of simplicity, I have a pipeline that reads a
> file name from a s3 bucket, and than stores it in a mysql table.
>
> Normally I would use one sensor or operator to get the file name, and than
> a second operator to store it in mysql. (While for example using xCom to
> communicate the name between them).
>
> I understand this might be the preffered course of action, and that is
> what I currently do!
> However, what I don't understand is why can't I just get the file name
> within the DAG itself.
> Why is it considered to be a bad practice to do any data related
> processing or gathering in the DAG?
>
> I can use the AWS API to easily retrieve the file name and store it in a
> regular Python "global" variable. Than I will only have one operator that
> takes this file name and stores it in mysql.
>
> Each time the DAG will be parsed for execution, my code that uses the AWS
> API will run again and provide me with a new file name.
>
> Am I missing something?
>
> Thank you very much, this has gotten me so curious!
>