You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/11/11 22:29:25 UTC

[GitHub] [airflow] GagandeepS opened a new issue #19548: Option to include subdir in trigger_dag to not make scheduler scan the whole dag folder

GagandeepS opened a new issue #19548:
URL: https://github.com/apache/airflow/issues/19548


   ### Description
   
   We have a design where 10s of child dags gets created every few minutes and gets triggered from a parent dag. While triggering each of the child dag, I believe scheduler searches the whole dag bag which is making the whole process slower. We have put a while loop to run TriggerDagRunOperator/trigger_dagy, if it is successful in triggering then it exits the loop otherwise trigger it again.
   
   I believe that to decrease the load from the scheduler, there should be a provision to supply subdir to TriggerDagRunOperator so that scheduler only searched for the dag_id inside of that folder instead of whole dagbag.
   
   ### Use case/motivation
   
   We have a parent dag trigger multiple child dags and we want to decrease the time to let scheduler discover the child dag faster.
   
   ### Related issues
   
   Discussion: https://github.com/apache/airflow/discussions/19547
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] GagandeepS edited a comment on issue #19548: Option to include subdir in trigger_dag to not make scheduler scan the whole dag folder

Posted by GitBox <gi...@apache.org>.
GagandeepS edited a comment on issue #19548:
URL: https://github.com/apache/airflow/issues/19548#issuecomment-968409201


   So we have a use case where multiple dynamic dags are getting added to the dagbag and I believe there will always be a latency between dropping a new dag into the dagbag folder and operator checking if the path/record of that new dag exists in the table or not using trigger_dag. 
   
   So, to let scheduler take as much time it needs to insert the record into the table, we trigger the new dag and 
   check if it gets triggered without error or not. If there is an error (usually 'Dag xxx does not exists') then it retries again in some time. So far so good, except when there is a peak load (10s of DAGs are getting generated dynamically and getting saved in the DAG bag). In this case scheduler gets slow coz it needs to insert multiple record and hence trigger_dag (coz of retry) takes 3-10min. I want to minimize this 3-10min.
   
   Proposed solution: Potentially, either add a table in airflow backend data model or use an index or bulk insert or similar so that the performance of scheduler, while inserting the new record, does not gets hampered and searching of the new dag gets faster.
   
   Just a thought: May be if we can have the provision to change the type of DB of Airflow so that instead of postgres, we can change it to a NoSQL with index matching that in postgres right now (I am hoping) so that inserting and searching gets faster and one can manage a not-so-fast update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] GagandeepS commented on issue #19548: Option to include subdir in trigger_dag to not make scheduler scan the whole dag folder

Posted by GitBox <gi...@apache.org>.
GagandeepS commented on issue #19548:
URL: https://github.com/apache/airflow/issues/19548#issuecomment-968409201


   So we have a use case where multiple dynamic dags are getting added to the dagbag and I believe there will always be a latency between dropping a new dag into the dagbag folder and operator checking if the path/record of that new dag exists in the table or not using trigger_dag. 
   
   So, to let scheduler take as much time it needs to insert the record into the table, we trigger the new dag and 
   check if it gets triggered without error or not. If there is an error (usually 'Dag xxx does not exists') then it retries again in some time. So far so good, except when there is a peak load (10s of DAGs are getting generated dynamically and getting saved in the DAG bag). In this case scheduler gets slow coz it needs to insert multiple record and hence trigger_dag (coz of retry) takes 3-10min. I want to minimize this 3-10min.
   
   Proposed solution: Potentially, either add a table in airflow backend data model or use an index or bulk insert or similar so that the performance of scheduler, while inserting the new record, does not gets hampered and searching of the new dag gets faster.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #19548: Option to include subdir in trigger_dag to not make scheduler scan the whole dag folder

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #19548:
URL: https://github.com/apache/airflow/issues/19548#issuecomment-967714463


   As I can see in the code, this operator checks the path to DAG File in the database and only loads one file. I don't know what you would like to optimize here
   https://github.com/apache/airflow/blob/37a12e9c278209d7e8ea914012a31a91a6c6ccff/airflow/api/common/experimental/trigger_dag.py#L117


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] boring-cyborg[bot] commented on issue #19548: Option to include subdir in trigger_dag to not make scheduler scan the whole dag folder

Posted by GitBox <gi...@apache.org>.
boring-cyborg[bot] commented on issue #19548:
URL: https://github.com/apache/airflow/issues/19548#issuecomment-966671171


   Thanks for opening your first issue here! Be sure to follow the issue template!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org