You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airflow.apache.org by Oleksandr Muliar <ol...@justeattakeaway.com> on 2020/11/12 17:21:33 UTC

[DISCUSS] DagFileProcessor - should it process multiple files per process?

Hello, everyone!

I hope this is the right place to ask about this, please redirect me
otherwise :)

I was looking into how dag files are imported, and noticed that airflow
creates a whole new process for each file that can potentially contain
DAGs, and then closes the process after only processing a single file.

It would seem to me that keeping the process around to parse multiple files
would be much more efficient (keeps sqlalchemy connections around, for
example). Is there a specific reason this design was selected, and if no -
is there any interest in changing this?

The initial reason for me to look into this is that DagBag filling time
seems to be rather slow when we have a significant amount of dag files
(more than a thousand files)

Regards,
Oleksandr

-- 



This email and any files transmitted with it contain confidential 
information and/or privileged or personal advice. This email is intended 
for the addressee(s) stated above only. If you are not the addressee of the 
email please do not copy or forward it or otherwise use it or any part of 
it in any form whatsoever. If you have received this email in error please 
notify the sender and remove the e-mail from your system. Thank you.


This 
is an email from the company Just Eat Takeaway.com N.V., a public limited 
liability company with corporate seat in Amsterdam, the Netherlands, and 
address at Oosterdoksstraat 80, 1011 DK Amsterdam, registered with the 
Dutch Chamber of Commerce with number 08142836 and where the context 
requires, includes its subsidiaries and associated undertakings.

Re: [DISCUSS] DagFileProcessor - should it process multiple files per process?

Posted by Oleksandr Muliar <ol...@justeattakeaway.com>.

Hi Ash,

Thanks for the suggestion! Great to hear that the scheduling issue should
be resolved now.

I still want to look into dag file parsing optimization, as it feels like
even after AIP-15 the delay between DAG directory update and the changes
being reflected into the DB can be significant in the case of a large
amount of DAG files(1500+ in our case).

As you suggest, will measure on 2.0 beta first though.

Thank you,
Oleksandr

On Thu, 12 Nov 2020 at 19:47, Ash Berlin-Taylor <as...@apache.org> wrote:

> Hi Oleksandr,
>
> So, not to short circuit the discussion, but with the HA work I did
> (AIP-15) that is available in 2.0.0beta2, the scheduler has been massively
> overhauled, and one of the changes was to break the tie between parsing and
> scheduling; the scheduler now operates on the serialised version from the
> db.
>
> So dag file parsing time is much less of a limitation.
>
> But some "max_dag_files_per_parser" setting may help, but I'd see if 2.0
> fixes your performance issues first.
>
> -Ash
>
>
> On 12 November 2020 17:21:33 GMT, Oleksandr Muliar <
> oleksandr.muliar@justeattakeaway.com> wrote:
>>
>> Hello, everyone!
>>
>> I hope this is the right place to ask about this, please redirect me
>> otherwise :)
>>
>> I was looking into how dag files are imported, and noticed that airflow
>> creates a whole new process for each file that can potentially contain
>> DAGs, and then closes the process after only processing a single file.
>>
>> It would seem to me that keeping the process around to parse multiple
>> files would be much more efficient (keeps sqlalchemy connections around,
>> for example). Is there a specific reason this design was selected, and if
>> no - is there any interest in changing this?
>>
>> The initial reason for me to look into this is that DagBag filling time
>> seems to be rather slow when we have a significant amount of dag files
>> (more than a thousand files)
>>
>> Regards,
>> Oleksandr
>>
>>
>> ------------------------------
>> This email and any files transmitted with it contain confidential
>> information and/or privileged or personal advice. This email is intended
>> for the addressee(s) stated above only. If you are not the addressee of the
>> email please do not copy or forward it or otherwise use it or any part of
>> it in any form whatsoever. If you have received this email in error please
>> notify the sender and remove the e-mail from your system. Thank you.
>>
>> This is an email from the company Just Eat Takeaway.com N.V., a public
>> limited liability company with corporate seat in Amsterdam, the
>> Netherlands, and address at Oosterdoksstraat 80, 1011 DK Amsterdam,
>> registered with the Dutch Chamber of Commerce with number 08142836 and
>> where the context requires, includes its subsidiaries and associated
>> undertakings.
>>
>

-- 

This email and any files transmitted with it contain confidential 
information and/or privileged or personal advice. This email is intended 
for the addressee(s) stated above only. If you are not the addressee of the 
email please do not copy or forward it or otherwise use it or any part of 
it in any form whatsoever. If you have received this email in error please 
notify the sender and remove the e-mail from your system. Thank you.

This 
is an email from the company Just Eat Takeaway.com N.V., a public limited 
liability company with corporate seat in Amsterdam, the Netherlands, and 
address at Oosterdoksstraat 80, 1011 DK Amsterdam, registered with the 
Dutch Chamber of Commerce with number 08142836 and where the context 
requires, includes its subsidiaries and associated undertakings.

Re: [DISCUSS] DagFileProcessor - should it process multiple files per process?

Posted by Ash Berlin-Taylor <as...@apache.org>.

Hi Oleksandr,

So, not to short circuit the discussion, but with the HA work I did (AIP-15) that is available in 2.0.0beta2, the scheduler has been massively overhauled, and one of the changes was to break the tie between parsing and scheduling; the scheduler now operates on the serialised version from the db.

So dag file parsing time is much less of a limitation.

But some "max_dag_files_per_parser" setting may help, but I'd see if 2.0 fixes your performance issues first.

-Ash


On 12 November 2020 17:21:33 GMT, Oleksandr Muliar <ol...@justeattakeaway.com> wrote:
>Hello, everyone!
>
>I hope this is the right place to ask about this, please redirect me
>otherwise :)
>
>I was looking into how dag files are imported, and noticed that airflow
>creates a whole new process for each file that can potentially contain
>DAGs, and then closes the process after only processing a single file.
>
>It would seem to me that keeping the process around to parse multiple
>files
>would be much more efficient (keeps sqlalchemy connections around, for
>example). Is there a specific reason this design was selected, and if
>no -
>is there any interest in changing this?
>
>The initial reason for me to look into this is that DagBag filling time
>seems to be rather slow when we have a significant amount of dag files
>(more than a thousand files)
>
>Regards,
>Oleksandr
>
>-- 
>
>
>
>This email and any files transmitted with it contain confidential 
>information and/or privileged or personal advice. This email is
>intended 
>for the addressee(s) stated above only. If you are not the addressee of
>the 
>email please do not copy or forward it or otherwise use it or any part
>of 
>it in any form whatsoever. If you have received this email in error
>please 
>notify the sender and remove the e-mail from your system. Thank you.
>
>
>This 
>is an email from the company Just Eat Takeaway.com N.V., a public
>limited 
>liability company with corporate seat in Amsterdam, the Netherlands,
>and 
>address at Oosterdoksstraat 80, 1011 DK Amsterdam, registered with the 
>Dutch Chamber of Commerce with number 08142836 and where the context 
>requires, includes its subsidiaries and associated undertakings.