You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "potiuk (via GitHub)" <gi...@apache.org> on 2023/02/20 17:09:12 UTC

[GitHub] [airflow] potiuk commented on a diff in pull request #28256: Include full path to Python files under zip path while clearing import errors.

potiuk commented on code in PR #28256:
URL: https://github.com/apache/airflow/pull/28256#discussion_r1112190405


##########
airflow/dag_processing/manager.py:
##########
@@ -782,7 +782,11 @@ def clear_nonexistent_import_errors(file_paths: list[str] | None, session=NEW_SE
         """
         query = session.query(errors.ImportError)
         if file_paths:
-            query = query.filter(~errors.ImportError.filename.in_(file_paths))
+            for file_path in file_paths:
+                if file_path.endswith(".zip"):
+                    query = query.filter(~(errors.ImportError.filename.startswith(file_path)))
+                else:
+                    query = query.filter(errors.ImportError.filename != file_path)

Review Comment:
   Aren't we risking very long queries here @tirkarthi ? Potentially there can be many of those) . I wonder If there are limits in SQLalchemy / converting to SQL when there are thousands of AND conditions. Is it possible to test it on a fake DAG  folder with say 10.000 DAG files (mysql/postgres)?
   
   For example this one - for SQL Server: https://stackoverflow.com/questions/1869753/maximum-size-for-a-sql-server-query-in-clause-is-there-a-better-approach suggests that IN (a,b,c) is nothing comparing to chained conditions and that the latter might easily fail due to stack size. It also suggests that nowadays stack sizes are deep, but I have a hunch that running huge query like that might be terribly expensive:
   
   > Other than that, your query is limited by runtime conditions. It will usually run out of stack size because x IN (a,b,c) is nothing but x=a OR x=b OR x=c which creates an expression tree similar to x=a OR (x=b OR (x=c)), so it gets very deep with a large number of OR. SQL 7 would hit a SO [at about 10k values in the IN](http://support.microsoft.com/kb/288095), but nowdays stacks are much deeper (because of x64), so it can go pretty deep.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org