You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2019/11/18 20:59:32 UTC

[GitHub] [airflow] potiuk commented on a change in pull request #6596: [AIRFLOW-6004] Untangle Executors class to avoid cyclic imports. Depends on [AIRFLOW-6010]

potiuk commented on a change in pull request #6596: [AIRFLOW-6004] Untangle Executors class to avoid cyclic imports. Depends on [AIRFLOW-6010]
URL: https://github.com/apache/airflow/pull/6596#discussion_r347606197
 
 

 ##########
 File path: tests/dags/test_subdag.py
 ##########
 @@ -24,7 +24,7 @@
 
 from datetime import datetime, timedelta
 
-from airflow.models import DAG
+from airflow.models.dag import DAG
 
 Review comment:
   I think the main reason for our disagreement here is because (in this case) I think more about contributors (especially first-time ones) and you think more about DAG developers. This is the old "library" vs "application" thing. And we are both kind of right (but with different users in mind).
   
   I agree that users should have a stable place to import DAG and AirflowException from (in this view airflow is a library), while it confuses developers (and might easily lead to circular imports).
   
   Let's try to find some approach here that will be good for both. I split the discussion into several smaller subjects to make it more productive.
   
   ## Finding imports
   
   I know not everyone ;) likes IDEs, but a lot of people does. This is what it looks like when I try to (in PyCharm) import AirflowException:
   
   Before my change:
   
   ![Screenshot 2019-11-18 at 20 58 19](https://user-images.githubusercontent.com/595491/69085738-6c594680-0a46-11ea-9da0-a6356a575b5f.png)
   
   And after:
   
   ![Screenshot 2019-11-18 at 20 59 01](https://user-images.githubusercontent.com/595491/69086124-84c96100-0a46-11ea-9950-786d76d0efaa.png)
   
   The class that throws deprecation warning is not even shown as importable and I know exactly which class I have to import and i even get a chance to import it locally. Same with DAG.  It's a nuance and not something must-have, but it's nice.
   
   ## Result of the current situation
   
   I can assure that myself (and many other people) got confused at this point - which is the place we should import DAG from. And it's hard data, not guessing: 
   
   ```
   [potiuk:~/code/airflow] remove-dag-out-of-airflow-package-import+ 3d6h54m19s ± git show HEAD | grep 'from airflow import DAG' | grep ^- | wc
         80     346    2219
   [potiuk:~/code/airflow] remove-dag-out-of-airflow-package-import+ 3d6h54m23s ± git show HEAD | grep 'from airflow.models import DAG' | grep ^- | wc
         97     507    4233
   ```
   We have now 80 places where someone used `from airflow import DAG` and 97 where people used `from airflow.models import DAG`.  Maybe we have some rules, but they are neither documented nor enforced.
   
   Can you tell  which way is better? Are you sure people know consequences of using them and the circular imports they might create this way ? I really, really doubt this. As a new user I cannot reason about the distinction. I am confused why those different import path exist. I have no idea which import should I use in my code. Seems like people randomly choose the import they should use without knowing the consequences. Certainly from an "application/contributors" point of view it's not good.
   
   ## DAG developer's perspective
   
   > I'm not a fan of this as it would require almost every single DAG that exists right now to be changed.
   
   It does not require them to change - it's merely deprecation warning. But indeed it encourages them to use the models.dag package rather than airflow.DAG. And yes - I agree that having a single place to import for is good for airflow-as-a-library but not necessary good for airflow-as-an-application . 
   
   So maybe we should do something else instead. Mabe the deprecation warning should only be thrown if you are importing airflow from within airflow core code itself? I will certainly look for a solution here - maybe you can also think of something? And it would be ideal the other way round - if the DAG developers were discouraged to import the airflow.models.dag DAG.
   
   And thinking more about it in the context of AIP-21 - I think those circular imports are only a problem of the "core" of airflow and that the "providers" part should be treated differently. Especially that we have to be careful about backporting. Since we plan all the providers package to be 1.10.* installable, they should use airflow.DAG likely as import. 
   
   I hope you can agree with me that it makes sense for Airflow "core" in airflow repository use a single, direct import (airflow.models.dag.DAG) where circular imports will be least likely. And then "providers" might use the "airflow.DAG" for backwards compatibility (and serving as model for other integrations). Then I yet have to see if I can make deprecation works in the way I want.
   
   I think pre-commits + AIP-21 are a good place to start - we could enforce different rules for different parts of the code in fact and educate people this way.
   
   WDYT @ashb?
   
   ## Avoiding cyclic imports 
   
   > We can manage the cyclic imports by explicitly moving some imports to the end of airflow/__init__.py
   
   Which ones? Do you know by heart? I am utterly confused when it comes to the sequence of imports there and I always have to go through many files and imports to understand what's going on -> airflow imports executors -> import kubernetes executor -> import pod -> import pod launcher -> import baseoperator -> import dag  -> import airflow ....... Good import structure should be in fact a .... DAG (!) .... Only then you can start reasoning about it :). 
   
   ## Importing 'airflow' package first 
   
   > Once import airflow has happened then all the packages are loaded an import order doesn't matter, so we can handle this all "internally" (which we already do via our unit tests). If it's working right now, then no matter what order it is imported by the user no import cycles are possible is my understanding.
   
   But I think we should really be able to not airflow first. I had really hard time trying to fix and correct some tests precisely because of the "import airflow" first. That's an anti-pattern especially that in tests we had to reload the configuration as it has already been loaded and we had to reload it. The way we run it now is really bad for tests.
   
   There is some initialisation magic happening in __init__ which then requires you to reload configuration in a specific way and some tests require this, some not. I know also @nuclearpinguin had a lot of problems trying to fix it in pytest context (and I had it for a long time at many opportunities). 
   
   In an ideal world there should be no requirement to import airflow first (eventually). We should be able to run some initialisation method or create an object to bootstrap airflow but importing airflow should not have this initialisation side-effect (it's really a side-effect).
   
   Running code in __init__.py is, I think, good for library (when you have to perform certain initialisations with however loads the library) but it's not really good for application such as airflow where you have a few clear entry-points and you can run the initialisation explicitly.
   
   I hope that one day we will get there - and this import restructuring might be one of the steps that could get us there.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services