You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2020/12/10 12:22:07 UTC

[GitHub] [airflow] potiuk opened a new issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

potiuk opened a new issue #12983:
URL: https://github.com/apache/airflow/issues/12983


   Thanks to #8874 we have now the possibility of selectively building docs, we should add support for it for our CI builds selective checks and to add support (or simply add help message for that). This will immensely speed up the CI builds for most of the PRs because we can build only documentation for Providers if particular providers changed and for core only when core doc changes.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742497222


   More like documenting it in the docs and help (I suspected it works). And for CI it's a bit more involved. we ned to figure the right flags looking at the list of changes coming in the PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742515225


   This is a bit dangerous as one documentation package can depend on another, so if we are not building all packages there is a chance that building one package will cause problems in another package and we will not see it before merging.
   
   I think it might be a good idea to wait a while while we fix other minor teething problems and then keep on adding optimizations. This task will also be easier if we finish this ticket: https://github.com/apache/airflow/issues/12282
   
   For now, our tasks are still queued most of the time, as other projects use much more resources.  Optimizing these projects can be of greater benefit to all.
   
   Is there any reason why this optimization can't wait? In addition, of course, the passion for continuous improvement for our CI. ❤️ 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742492732


   You can already do `./breeze build-docs -- --package-filter apache-airflow` -- are you thinking that we remove the need for the `--` and add things to completion here?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742523162


   > For now, our tasks are still queued most of the time, as other projects use much more resources. Optimizing these projects can be of greater benefit to all.
   
   BTW. Unlike speeding our builds, we have very limited impact there. We are already involved in the build meetings from the INFRA and we started together stats of other projects. If you have some ideas how we can do it differently - feel free to propose it (possibly in the builds@ discussion list). 
   
   I think we should focus on what we CAN do.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742497222


   More like documenting it in the docs. and help (I suspected it works). And for CI it's a bit more involved. we ned to figure the right flags looking at the list of changes coming in the PR


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761826284


   Fine for me: I'd say then we need (4). Include .rst cross-references.  
   
   Let's build the dependencies including documentation and generate a minimum set of "packages" that makes sense for it. Looking at your example, this should be as easy as finding all `~airflow.providers` references and we can build such a graph. This means that if for example one commits a change to "singularity" provider which is not referred anywhere else, it should only build docs (and tests) for that provider only.
   
   Does it make sense? I think this will be much easier to reason about than to hardcode core - + 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762303928


   > What do you mean by "unlimited" ?
   
   Here is a list of examples, some are incorrect but most should work.
   https://gist.github.com/mik-laj/9b5417532ee682a6f871ee90ae5869c0
   
   > And any time we found we missed one and it caused faillure we can find it and fix it if we find it caused problem (and incrementally get better at catching such cases). We will not get it perfect the first time, but it's fine. We will catch it an fix it, at the same time vast majority of the build will benefit from that.
   
   This is what I want to avoid. I would like to focus more on contributing to the project or to project documentation, rather than contributing to the contributing tools, unless absolutely necessary. A very small group of people wants to focus on contributing to the tools for contribution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742520793


   I do not think it is "we must do it now" :). However, doc building takes now 23 (!) minutes on CI, so especially in the case of the "no-tests needed" builds that do not run any of the unit tests, it is definitely the one that stands out.
   
   I think it has very little to do with #12282 to be honest because selective checks are NOT part of Breeze. They are fully separated and they have nothing to do with Breeze at all. The selective checks could have been rewritten in python as part of implementing the 'selective docs'. That might be actually a very good idea. 
   
   And I agree it is a bit dangerous, but so the selective tests are (however we always have the master builds after merge that run all the tests and docs and this should remain after this change) 
   
   We can make the logic a bit more complex if we rewrite it in python to make some protection against it (for example check very quickly if there are no :doc: or other references to the provider package that has just been changed.
   
   Just saying - it's a great opportunity to both - get some real benefits (in terms of feedback time and CI job strain) and to do it in a 'future-proof' way (i.e. python).
   
   WDYT?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762261729


   >  I'd argue changing only documentation is a bad smell.
   
   There are several types of documentation:
   
   * Tutorials
   * how-to guides
   * Concepts
   * Reference
   
   (source: https://www.polidea.com/blog/how-open-source-guide-new-contributors-maintainers/#document-the-code)
   
   Only in the case of reference documentation, it is a "bad smell", but other types of documentation very often arise as independent contributions, because it takes a different kind of skill.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
kaxil edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762443795


   Again, 100% agree to that @potiuk -- Thanks to you leading that effort we are in a much better shape.
   
   But for the issue at hand about the documentation optimisation was what we or at least I was talking about. Would definitely love to not break Master for that. As I said, I am happy with (3) but we need to be careful with trying to optimize where not strictly needed as I mentioned in the previous comment.
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk closed issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk closed issue #12983:
URL: https://github.com/apache/airflow/issues/12983


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762319487


   > improve your tools,
   
   This is a very subjective word. For some, it is important that the tools are stable, reliable and as simple as possible. For others, it is more important that they are fast, efficient, feature-full, but this can create difficulties with stability and reliability and add more complexity.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762443795


   Again, 100% agree to that @potiuk -- Thanks to you leading that effort we are in a much better shape.
   
   But for the issue at hand about the documentation optimisation was what we or at least I was talking about. Would definitely love to not break Master for that. As I said, I am happy with (3) but we need to be careful with trying to optimize where not strictly needed as I mentioned in the previous comment.
   
   All the good work you and everyone who contributed to the CI is very very much appreciated by everyone but I think we should now focus on stability and the core product.
   
   The number of people who understands the current CI system is in single digits and it is very easy to put a blind eye on that one too because you are working on it. That is not sustainable in the long run.
   
   Because it is evolving quickly -- it is difficult to understand to keep track of the changes -- I would really really stress on the fact that we need more stability and more people knowing the current CI and how it works. 
   
   >This is compound change of many things - including the fact that most of us can focus on contribution to product, because there are people who constantly keep the machinery not only going but also improving and rolling the wheels better (even if is more complex).
   
   This is exactly what we should not try to do mainly because currently "the people" is just you for major issues (or Ash for some) -- other contributors (including me) fix issues here and there but fallback to you.
   
   "I am in favor of introducing these optimizations if we are confident in its stability and reliability, and it is also easy to maintain." -- I still 100% agree with this statement from Kamil. 
   
   >And any time we found we missed one and it caused faillure we can find it and fix it if we find it caused problem (and incrementally get better at catching such cases). We will not get it perfect the first time, but it's fine. We will catch it an fix it, at the same time vast majority of the build will benefit from that.
   
   This is what I don't agree with you about -- there is no need for Master to fail for something which is not strictly required.
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761826613


   And I argue it's only the "target" reference that matters. I .e. if  a provider refers a core class but 'core" does not change at all (and core does not refer this provider), it means that only the provider doc/tests should be built.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761826284


   I'd say then we need (4). Include .rst cross-references.  
   
   Let's build the dependencies including documentation and generate a minimum set of "packages" that makes sense for it. Looking at your example, this should be as easy as finding all `~airflow.providers` references and we can build such a graph. This means that if for example one commits a change to "singularity" provider which is not referred anywhere else, it should only build docs (and tests) for that provider only.
   
   Does it make sense?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761826284


   Fine for me: I'd say then we need (4). Build only docs components that come out of both .py and .rst cross-references.  
   
   Let's build the dependencies including documentation and generate a minimum set of "packages" that makes sense for it. Looking at your example, this should be as easy as finding all `~airflow.providers` references and we can build such a graph. This means that if for example one commits a change to "singularity" provider which is not referred anywhere else, it should only build docs (and tests) for that provider only.
   
   Does it make sense? I think this will be much easier to reason about than to hardcode core - + 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762291754


   What do you mean by "unlimited" ?
   
    Surely we are using limited number of those and we can enumerate those. 
   
   And any time we found we missed one and it caused faillure we can find it and fix it if we find it caused problem (and incrementally get better at catching such cases). We will not get it perfect the first time, but it's fine. We will catch it an fix it, at the same time vast majority of the build will benefit from that. 
   
   Again: Mulitply savings by number of times those builds are run. this is the gain.
   Implementation is one-time cost.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761709707


   I agree with @mik-laj 's comment in https://github.com/apache/airflow/issues/12983#issuecomment-742515225
   
   As I said in https://github.com/apache/airflow/issues/13716#issuecomment-761702451: let's not try to optimize this because we use cross-reference of apache-airflow in some providers and vice-versa, so we should build all the docs.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742499085


   Or maybe even not - I even did not know the --package flag has been added here and I just checked that --help displays them. So Breeze part is done,  only the CI speedup is needed with selective checks.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742491911


   cc: @mik-laj @kaxil @ashb - > good opportunity for one of you to get  a bit more understanding of the CI :). Hapy to help with that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762307428


   > This is what I want to avoid. I would like to focus more on contributing to the project or to project documentation, rather than contributing to the contributing tools, unless absolutely necessary. A very small group of people wants to focus on contributing to the tools for contribution.
   
   The only way to get better is to continuously improve your tools, especially when you can improve lives of many people by making improvements.  I think this is extremely important.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762430800


   Surely this is all subjective. And it's soooo easy to forget about things like werkzeug drama (and a number of other problems). And about alll the issues flooding us when one day suddenly airflow could not be installed because dependencies broken. Or about mypy/flake errors that a number of people could not repreoduce or about failing kubernetes tests because the scripts combined tox, docker image and the K8S setup was failing randomly (which was far more complex than current tests TBH and no-one understood it then anyway). Or about conflicting constraints and totally outdated dependencies and base python images we had for the docker image (which are getting improved)
   
   Most of the complexity of the system we have now is the result of fixing those problems (in the way that prevents them from reoccurring - that's why all those things are constantly running on our CI and double-checking). The complexity is emerging from the complexity of the system we have,  not "reason" on its own.
   
   This is often the case that you fail to see the things that "just work" and forget how much work was put into getting to that "just works" state. It's really, really easy to be overhwelmed by a number of issues like that. If you do not continuously work on fighting such issues (and yest sometimes it means increasing complexity) you might one day wake up in the state that you have *just* enough capacity to fight with the on-going issues. 
   
   It's super easy to underestimate importance of it and put a blind eye on it. I've seen many CI systems that took that slippery road - the problem with it is that once you let it happen it goes down with accelerating pattern. System of that size and that many people contributing people is inevitable to take quite some maintenance time.
   
   I wonder if you've thought about the number changes (when CI works reasonable) that we are able to handle because of all the systems we have in place now rather than "just code". If you look at that chart below, we have hardly more active people working on Airflow as 'active commiters' than 1.5 years ago, yet we are able to handle many more changes:
   
   ![Screenshot from 2021-01-18 20-10-42](https://user-images.githubusercontent.com/595491/104954921-8c2d2700-59c9-11eb-86d4-6e4f494d0bde.png)
   
   This is compound change of many things - including the fact that most of us can focus on contribution to product, because there are people who constantly keep the machinery not only going but also improving and rolling the wheel better.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761827102


   And I'd say both optimisation (remove other jobs and selective build)  are equally important for me - only when we combine the two we will get real optimization - similar to what we have now when someone changes README.rst or any other "non-doc" folder change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761826284


   Fine for me: I'd say then we need (4). Build only docs components that come out of  .rst cross-references.  
   
   Let's build the dependencies including documentation and generate a minimum set of "packages" that makes sense for it. Looking at your example, this should be as easy as finding all `~airflow.providers` references and we can build such a graph. This means that if for example one commits a change to "singularity" provider which is not referred anywhere else, it should only build docs (and tests) for that provider only.
   
   Does it make sense? I think this will be much easier to reason about than to hardcode core - + 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761768795


   All such cross references are already automatically checked (via pre-commit) and stored in https://github.com/apache/airflow/blob/master/airflow/providers/dependencies.json . 
   
   I think we should be then very consistent in what we propose as optimizations and base it on actual gains achieved for everyone (not only for committer who push their change). I want to think 'globally' in terms of regular contributor adding their code not only in "let's optimise the speed of changes that I usually do". 
   
   I returned back to the subject after @mik-laj  raised this: #13706
   
   Now I want us to consider what makes more sense:
   
   1) Optimizing #13706 - doc/*.rst documentation-only change
   
   2) Optimizing docs build per-provider - only one provider code+documentation changes
   
   We are optimizing not  for "now" (i.e. 2.0.1 release and 2.0.0 'cleanups') but also for the next  (say) 5 months of Airflow Builds. 
   
   Whatever we invest now in - will get return of that investment in the coming months. 
   
   Let's compare how often those builds happen (1)  vs (2).
   
   For (1) those are builds that are usually "cleanup and refactor kinds" - we have them now a bit more often because we are  cleaning up stuff - restructuring the docs a little, fixing last spell checking issues.  However in "regular" work. I'd argue changing only documentation is a bad smell. Documentation should usually be changed together with the code when code change happens. So I argue that those builds will be extremely rare in the future. And they will mostly not impact the people who are making 'substantial' changes - those who need fast feedback on their builds.
   
   On the other hand, I think we will have a lot (I think vast majority) of "one-provider-only" changes coming from contributors. Those are already quite optimized during the tests and we can (I plan to)  also easily add 'run tests only for this provider and dependent ones (using the `dependencies.json`). 
   
   This will cut down test time to ~ 5 minutes for typical providers. Now, full docs build takes 25 minutes. If we also limit it to only those providers that changed, this docs build will also take 5 minutes for most of the providers (we do not have to build "airflow" docs in this case  - there should be "0" references from airflow to particular providers docs.
   
   This basically means that a contributor will get feedback about 'one-provider' change 20 minutes faster. Multiply it by a number of "one-provider" changes. 
   
   I have no time to pull the 'hard" data - i.e. how many changes we have that are (1) "doc/*.rst" only vs. (2) "single-provider" only. But my intuition tells me that over the next months vast majority of the changes will be of the (2) kind. 
   
   I firmly believe we should optimize those things that have higher impact and  bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.
   
   WDYT @kaxil @mik-laj which of those are worth it? (1) or (2)? I argue (above) that (2) has much higher impact and brings more benefits to the community as whole. If you think we should do (1) rather than (2), I'd love to hear your line of thoughts an reasoning why you think it is better to do it.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762418986


   >This is what I want to avoid. I would like to focus more on contributing to the project or to project documentation, rather than contributing to the contributing tools, unless absolutely necessary. A very small group of people wants to focus on contributing to the tools for contribution.
   
   Strongly agree with this statement from @mik-laj . 
   
   >The only way to get better is to continuously improve your tools, especially when you can improve lives of many people by making improvements. I think this is extremely important.
   
   Agree with that in principle, but the most important part of the contributing tool is "stability". If optimisation comes at the cost of stability and making it less easy to comprehend (our CI system is already getting difficult to grasp at least all of it). 
   
   >And any time we found we missed one and it caused faillure we can find it and fix it if we find it caused problem (and incrementally get better at catching such cases). We will not get it perfect the first time, but it's fine. We will catch it an fix it, at the same time vast majority of the build will benefit from that.
   
   That is causing a significant issues with UX because Master starts failing and we need to tell everyone to rebase to Master. I can understand it when it is absolutely needed or a big optimisation but for the current docs optimisation, while the (3) or (4) we mentioned in the comment would be good to have but not if it comes with a cost.
   
   
   >This is a very subjective word. For some, it is important that the tools are stable, reliable and as simple as possible. For others, it is more important that they are fast, efficient, feature-full, but this can create difficulties with stability and reliability and add more complexity. I am in favor of introducing these optimizations if we are confident in its stability and reliability, and it is also easy to maintain.
   
   Agree with this 100% too -- I would strongly prefer stability than speed. Like Kamil said -- only if it necessary or does not compromise stability otherwise that time can be used to improve the Product itself. The CI system / script should not become a product of it's own. It is there to compliment the product and make it more stable -- which with breeze I think we already have.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742520793


   I do not think it is "we must do it now" :). However, doc building takes now 23 (!) minutes on CI, so especially in the case of the "no-tests needed" builds that do not run any of the unit tests, it is definitely the one that stands out.
   
   I think it has very little to do with #12282 to be honest because selective checks are NOT part of Breeze. They are fully separated and they have nothing to do with Breeze at all. The selective checks could have been rewritten in python as part of implementing the 'selective docs'. That might be actually a very good idea. 
   
   And I agree it is a bit dangerous, but so all the selective tests are (however we always have the master builds after merge that run all the tests and docs and this should remain after this change) 
   
   We can make the logic a bit more complex if we rewrite it in python to make some protection against it (for example check very quickly if there are no :doc: or other references to the provider package that has just been changed.
   
   Just saying - it's a great opportunity to both - get some real benefits (in terms of feedback time and CI job strain) and to do it in a 'future-proof' way (i.e. python).
   
   WDYT?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761826284


   Fine for me: I'd say then we need (4). Include .rst cross-references.  
   
   Let's build the dependencies including documentation and generate a minimum set of "packages" that makes sense for it. Looking at your example, this should be as easy as finding all `~airflow.providers` references and we can build such a graph. This means that if for example one commits a change to "singularity" provider which is not referred anywhere else, it should only build docs (and tests) for that provider only.
   
   Does it make sense?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761768795


   All such cross references are already automatically checked (via pre-commit) and stored in https://github.com/apache/airflow/blob/master/airflow/providers/dependencies.json . 
   
   I think we should be then very consistent in what we propose as optimizations and base it on actual gains achieved for everyone (not only for committer who push their change). I want to think 'globally' in terms of regular contributor adding their code not only in "let's optimise the speed of changes that I usually do". 
   
   I returned back to the subject after @mik-laj  raised this: #13706
   
   Now I want us to consider what makes more sense:
   
   1) Optimizing #13706 - doc/*.rst documentation-only change
   
   2) Optimizing docs build per-provider - only one provider code+documentation changes
   
   We are optimizing not  for "now" (i.e. 2.0.1 release and 2.0.0 'cleanups') but also for the next  (say) 5 months of Airflow Builds. 
   
   Whatever we invest now in - will get return of that investment in the coming months. 
   
   Let's compare how often those builds happen (1)  vs (2).
   
   For (1) those are builds that are usually "cleanup and refactor kinds" - we have them now a bit more often because we are  cleaning up stuff - restructuring the docs a little, fixing last spell checking issues.  However in "regular" work. I'd argue changing only documentation is a bad smell. Documentation should usually be changed together with the code when that code happens. So I argue that those builds will be extremely rare in the future. And they will mostly not impact the people who are making 'substantial' changes - those who need fast feedback on their builds.
   
   On the other hand, I think we will have a lot (I think vast majority) of "one-provider-only" changes coming from contributors. Those are already quite optimized during the tests and we can (I plan to)  also easily add 'run tests only for this provider and dependent ones (using the `dependencies.json`). 
   
   This will cut down test time to ~ 5 minutes for typical providers. Now, full docs build takes 25 minutes. If we also limit it to only those providers that changed, this docs build will also take 5 minutes for most of the providers (we do not have to build "airflow" docs in this case  - there should be "0" references from airflow to particular providers docs.
   
   This basically means that a contributor will get feedback about 'one-provider' change 20 minutes faster. Multiply it by a number of "one-provider" changes. 
   
   I have no time to pull the 'hard" data - i.e. how many changes we have that are (1) "doc/*.rst" only vs. (2) "single-provider" only. But my intuition tells me that over the next months vast majority of the changes will be of the (2) kind. 
   
   I firmly believe we should optimize those things that have higher impact and  bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.
   
   WDYT @kaxil @mik-laj which of those are worth it? (1) or (2)? I argue (above) that (2) has much higher impact and brings more benefits to the community as whole. If you think we should do (1) rather than (2), I'd love to hear your line of thoughts an reasoning why you think it is better to do it.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742498194


   We need to have list of packages here most likely. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil edited a comment on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
kaxil edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761819429


   >All such cross references are already automatically checked (via pre-commit) and stored in https://github.com/apache/airflow/blob/master/airflow/providers/dependencies.json .
   
   This is not the cross-reference we (at least me) are talking about. Google Providers for example reference xcom documentation from apache-airflow (core) docs. Example: https://github.com/apache/airflow/blob/master/docs/apache-airflow-providers-google/operators/marketing_platform/campaign_manager.rst
   
   And similarly apache-airflow (core) docs reference `~airflow.providers.http.operators.http.SimpleHttpOperator` and `~airflow.providers.sqlite.operators.sqlite.SqliteOperator`, `~airflow.providers.jdbc.hooks.jdbc.JdbcHook`
   
   So at the very least, we need to build docs:
   
   - `apache-airflow`
   - `apache-airflow-providers`
   - `apache-airflow-providers-PROVIDER_THAT_CHANGES`
   
   
   >Now I want us to consider what makes more sense:
   > 1. Optimizing #13706 - doc/*.rst documentation-only change
   > 2. Optimizing docs build per-provider - only one provider code+documentation changes
   
   (2) is not possible because of the reason I explained above. We need (3)  Optimizing docs build **core** + **apache-airflow-providers** + per-provider - only one provider code+documentation changes + **core** docs + **apache-airflow-providers**
   
   
   >I'd argue changing only documentation is a bad smell. Documentation should usually be changed together with the code when code change happens. So I argue that those builds will be extremely rare in the future. And they will mostly not impact the people who are making 'substantial' changes - those who need fast feedback on their builds.
   
   Wrong, we encourage doc only changes for new contributors, where they can add a missing section, fix formatting issues, correct outdated information, definitely not a bad smell. Documentation has to be a first-class citizen and important for us as a project, more so because we are an OSS project. The history of doc only changes (based on a number of commits) is also large. And at least documentation will be always evolving (getting better) -- not only now because we released a major version.
   
   >We do not have to build "airflow" docs in this case - there should be "0" references from airflow to particular providers docs.
   I don't think so, it is completely find for just to reference for example a Slack provider to explain sla_miss_callback functionality or Secrets Backend.
   
   Example references:
   ```
   best-practices.rst:207:Similarly, if you have a task that starts a microservice in Kubernetes or Mesos, you should check if the service has started or not using :class:`airflow.providers.http.sensors.http.HttpSensor`.
   concepts.rst:434:by pre-installing some :doc:`apache-airflow-providers:index` packages (they are always available no
   concepts.rst:437:- :class:`~airflow.providers.http.operators.http.SimpleHttpOperator` - sends an HTTP request
   concepts.rst:438:- :class:`~airflow.providers.sqlite.operators.sqlite.SqliteOperator` - SQLite DB operator
   concepts.rst:443:additional packages manually (for example ``apache-airflow-providers-mysql`` package).
   concepts.rst:447:- :class:`~airflow.providers.mysql.operators.mysql.MySqlOperator`
   concepts.rst:448:- :class:`~airflow.providers.postgres.operators.postgres.PostgresOperator`
   concepts.rst:449:- :class:`~airflow.providers.microsoft.mssql.operators.mssql.MsSqlOperator`
   concepts.rst:450:- :class:`~airflow.providers.oracle.operators.oracle.OracleOperator`
   concepts.rst:451:- :class:`~airflow.providers.jdbc.operators.jdbc.JdbcOperator`
   concepts.rst:452:- :class:`~airflow.providers.docker.operators.docker.DockerOperator`
   concepts.rst:453:- :class:`~airflow.providers.apache.hive.operators.hive.HiveOperator`
   concepts.rst:454:- :class:`~airflow.providers.amazon.aws.operators.s3_file_transform.S3FileTransformOperator`
   concepts.rst:455:- :class:`~airflow.providers.mysql.transfers.presto_to_mysql.PrestoToMySqlOperator`,
   concepts.rst:456:- :class:`~airflow.providers.slack.operators.slack.SlackAPIOperator`
   concepts.rst:459:at :doc:`apache-airflow-providers:index`.
   concepts.rst:795:``conn_id`` for the :class:`~airflow.providers.postgres.hooks.postgres.PostgresHook` is
   howto/connection.rst:332:can also add a custom provider that adds custom connection types. See :doc:`apache-airflow-providers:index`
   howto/connection.rst:339::py:class:`~airflow.providers.jdbc.hooks.jdbc.JdbcHook`.
   howto/connection.rst:350:You can read more about details how to add custom provider packages in the :doc:`apache-airflow-providers:index`
   howto/custom-operator.rst:101:See :doc:`connection` for how to create and manage connections and :doc:`apache-airflow-providers:index` for
   howto/custom-operator.rst:268:is :class:`airflow.providers.google.cloud.sensors.gcs.GCSUploadSessionCompleteSensor`.
   howto/define_extra_link.rst:66:all the operators through an airflow plugin or through airflow providers. You can learn more about it in the
   howto/define_extra_link.rst:67::ref:`plugin example <plugin-example>` and in :doc:`apache-airflow-providers:index`.
   howto/define_extra_link.rst:77:tasks using :class:`~airflow.providers.amazon.aws.transfers.gcs_to_s3.GCSToS3Operator` operator.
   howto/define_extra_link.rst:86:  from airflow.providers.amazon.aws.transfers.gcs_to_s3 import GCSToS3Operator
   howto/define_extra_link.rst:112::class:`~airflow.providers.google.cloud.operators.bigquery.BigQueryExecuteQueryOperator` includes a link to the Google Cloud
   howto/define_extra_link.rst:119:    from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator
   howto/define_extra_link.rst:145:As explained in :doc:`apache-airflow-providers:index`, when you create your own Airflow Provider, you can
   howto/define_extra_link.rst:150:by ``apache-airflow-providers-google`` provider currently:
   howto/define_extra_link.rst:155:      - airflow.providers.google.cloud.operators.bigquery.BigQueryConsoleLink
   howto/define_extra_link.rst:156:      - airflow.providers.google.cloud.operators.bigquery.BigQueryConsoleIndexableLink
   howto/define_extra_link.rst:157:      - airflow.providers.google.cloud.operators.mlengine.AIPlatformConsoleLink
   howto/email-config.rst:76:      email_backend = airflow.providers.sendgrid.utils.emailer.send_email
   installation.rst:88:has a corresponding ``apache-airflow-providers-amazon`` providers package to be installed. When you install
   installation.rst:106:see: :doc:`apache-airflow-providers:index`
   installation.rst:108:For the list of the provider packages and what they enable, see: :doc:`apache-airflow-providers:packages-ref`.
   modules_management.rst:272:    apache-airflow-providers-amazon           | 1.0.0b2
   modules_management.rst:273:    apache-airflow-providers-apache-cassandra | 1.0.0b2
   modules_management.rst:274:    apache-airflow-providers-apache-druid     | 1.0.0b2
   modules_management.rst:275:    apache-airflow-providers-apache-hdfs      | 1.0.0b2
   modules_management.rst:276:    apache-airflow-providers-apache-hive      | 1.0.0b2
   operators-and-hooks-ref.rst:23::doc:`apache-airflow-providers:operators-and-hooks-ref/index`.
   plugins.rst:169:    from airflow.providers.amazon.aws.transfers.gcs_to_s3 import GCSToS3Operator
   production-deployment.rst:762:Some operators, such as :class:`airflow.providers.google.cloud.operators.kubernetes_engine.GKEStartPodOperator`,
   production-deployment.rst:763::class:`airflow.providers.google.cloud.operators.dataflow.DataflowStartSqlJobOperator`, require
   production-deployment.rst:869:If you want to establish an SSH connection to the Compute Engine instance, you must have the network address of this instance and credentials to access it. To simplify this task, you can use :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` instead of :class:`~airflow.providers.ssh.hooks.ssh.SSHHook`
   production-deployment.rst:871:The :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` support authorization with Google OS Login service. It is an extremely robust way to manage Linux access properly as it stores short-lived ssh keys in the metadata service, offers PAM modules for access and sudo privilege checking and offers nsswitch user lookup into the metadata service as well.
   security/secrets/secrets-backend/index.rst:69:    airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend
   upgrading-to-2.rst:117:    from airflow.providers.docker.operators.docker import DockerOperator
   upgrading-to-2.rst:126:automatically installs the ``apache-airflow-providers-docker`` package.
   upgrading-to-2.rst:131:You can read more about providers at :doc:`apache-airflow-providers:index`.
   upgrading-to-2.rst:507:  * You can read more about providers at :doc:`apache-airflow-providers:index`.
   ```
   
   >I firmly believe we should optimize those things that have higher impact and bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.
   
   We should not fall in the trap of over-optimisation though. We can still get a good reduction in time (that will result in more benefits and hopefully higher impact) as we won't be building docs for all providers (only `apache-airflow`, `apache-airflow-providers` and `apache-airflow-provder-PROVIDER_THAT_CHANGED`.) 
   
   >WDYT @kaxil @mik-laj which of those are worth it? (1) or (2)? I argue (above) that (2) has much higher impact and brings more benefits to the community as whole. If you think we should do (1) rather than (2), I'd love to hear your line of thoughts an reasoning why you think it is better to do it.
   
   As I said above, I would vouch for (1) and (3) not (2) or just (3) where it will supersede (1) too.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762430800


   Surely this is all subjective. And it's soooo easy to forget about things like werkzeug drama (and a number of other similar problems). And about alll the issues flooding us when one day suddenly airflow could not be installed because dependencies broken. Or about mypy/flake errors that a number of people could not reproduce or about failing kubernetes tests because the scripts combined tox, docker image and the K8S setup was failing randomly (which was far more complex than current tests TBH and no-one understood it then anyway). Or about conflicting constraints and totally outdated dependencies and base python images we had for the docker image (which are getting improved)
   
   Most of the complexity of the system we have now is the result of fixing those problems (in the way that prevents them from reoccurring - that's why all those things are constantly running on our CI and double-checking). The complexity is emerging from the complexity of the system we have,  not "reason" on its own.
   
   This is often the case that you fail to see the things that "just work" and forget how much work was put into getting to that "just works" state. It's really, really easy to be overhwelmed by a number of issues like that. If you do not continuously work on fighting such issues (and yest sometimes it means increasing complexity) you might one day wake up in the state that you have *just* enough capacity to fight with the on-going issues. 
   
   It's super easy to underestimate importance of it and put a blind eye on it. I've seen many CI systems that took that slippery road - the problem with it is that once you let it happen it goes down with accelerating speed. System of that size and that many people contributing people is inevitable to take quite some maintenance time.
   
   I wonder if you've thought about the number changes (when CI works reasonable) that we are able to handle because of all the systems we have in place now rather than "just code". If you look at that chart below, we have hardly more active people working on Airflow as 'active commiters' than 1.5 years ago, yet we are able to handle many more changes:
   
   ![Screenshot from 2021-01-18 20-10-42](https://user-images.githubusercontent.com/595491/104954921-8c2d2700-59c9-11eb-86d4-6e4f494d0bde.png)
   
   This is compound change of many things - including the fact that most of us can focus on contribution to product, because there are people who constantly keep the machinery not only going but also improving and rolling the wheel better.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762430800


   Surely this is all subjective. And it's soooo easy to forget about things like werkzeug drama (and a number of other similar problems). And about alll the issues flooding us when one day suddenly airflow could not be installed because dependencies broken. Or about mypy/flake errors that a number of people could not repreoduce or about failing kubernetes tests because the scripts combined tox, docker image and the K8S setup was failing randomly (which was far more complex than current tests TBH and no-one understood it then anyway). Or about conflicting constraints and totally outdated dependencies and base python images we had for the docker image (which are getting improved)
   
   Most of the complexity of the system we have now is the result of fixing those problems (in the way that prevents them from reoccurring - that's why all those things are constantly running on our CI and double-checking). The complexity is emerging from the complexity of the system we have,  not "reason" on its own.
   
   This is often the case that you fail to see the things that "just work" and forget how much work was put into getting to that "just works" state. It's really, really easy to be overhwelmed by a number of issues like that. If you do not continuously work on fighting such issues (and yest sometimes it means increasing complexity) you might one day wake up in the state that you have *just* enough capacity to fight with the on-going issues. 
   
   It's super easy to underestimate importance of it and put a blind eye on it. I've seen many CI systems that took that slippery road - the problem with it is that once you let it happen it goes down with accelerating pattern. System of that size and that many people contributing people is inevitable to take quite some maintenance time.
   
   I wonder if you've thought about the number changes (when CI works reasonable) that we are able to handle because of all the systems we have in place now rather than "just code". If you look at that chart below, we have hardly more active people working on Airflow as 'active commiters' than 1.5 years ago, yet we are able to handle many more changes:
   
   ![Screenshot from 2021-01-18 20-10-42](https://user-images.githubusercontent.com/595491/104954921-8c2d2700-59c9-11eb-86d4-6e4f494d0bde.png)
   
   This is compound change of many things - including the fact that most of us can focus on contribution to product, because there are people who constantly keep the machinery not only going but also improving and rolling the wheel better.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742497972


   ```
   
    Detailed usage for command: build-docs
   
   
    breeze build-docs [-- <EXTRA_ARGS>]
   
          Builds Airflow documentation. The documentation is build inside docker container - to
          maintain the same build environment for everyone. Appropriate sources are mapped from
          the host to the container so that latest sources are used. The folders where documentation
          is generated ('docs/_build') are also mounted to the container - this way results of
          the documentation build is available in the host.
   
          The possible extra args are: --docs-only, --spellcheck-only, --package, --help
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742520793


   I do not think it is "we must do it now" :). However, doc building takes now 23 (!) minutes on CI, so especially in the case of the "no-tests needed" builds that do not run any of the unit tests, it is definitely the one that stands out.
   
   I think it has very little to do with #12282 to be honest because selective checks are NOT part of Breeze. They are fully separated and they have nothing to do with Breeze at all. The selective checks could have been rewritten in python as part of implementing the 'selective docs'. That might be actually a very good idea. 
   
   And I agree it is a bit dangerous, but so all the selective tests are (however we always have the safety-net of master builds after merge that run all the tests and docs and this should remain after this change) 
   
   We can make the logic a bit more complex if we rewrite it in python to make some protection against it (for example check very quickly if there are no :doc: or other references to the provider package that has just been changed.
   
   Just saying - it's a great opportunity to both - get some real benefits (in terms of feedback time and CI job strain) and to do it in a 'future-proof' way (i.e. python).
   
   WDYT?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762430800


   Surely this is all subjective. And it's soooo easy to forget about things like werkzeug drama (and a number of other similar problems). And about alll the issues flooding us when one day suddenly airflow could not be installed because dependencies broken. Or about mypy/flake errors that a number of people could not reproduce or about failing kubernetes tests because the scripts combined tox, docker image and the K8S setup was failing randomly (which was far more complex than current tests TBH and no-one understood it then anyway). Or about conflicting constraints and totally outdated dependencies and base python images we had for the docker image (which are getting improved)
   
   Most of the complexity of the system we have now is the result of fixing those problems (in the way that prevents them from reoccurring - that's why all those things are constantly running on our CI and double-checking). The complexity is emerging from the complexity of the system we have,  not "reason" on its own.
   
   This is often the case that you fail to see the things that "just work" and forget how much work was put into getting to that "just works" state. It's really, really easy to be overhwelmed by a number of issues like that. If you do not continuously work on fighting such issues (and yest sometimes it means increasing complexity) you might one day wake up in the state that you have *just* enough capacity to fight with the on-going issues. 
   
   It's super easy to underestimate importance of it and put a blind eye on it. I've seen many CI systems that took that slippery road - the problem with it is that once you let it happen it goes down with accelerating pattern. System of that size and that many people contributing people is inevitable to take quite some maintenance time.
   
   I wonder if you've thought about the number changes (when CI works reasonable) that we are able to handle because of all the systems we have in place now rather than "just code". If you look at that chart below, we have hardly more active people working on Airflow as 'active commiters' than 1.5 years ago, yet we are able to handle many more changes:
   
   ![Screenshot from 2021-01-18 20-10-42](https://user-images.githubusercontent.com/595491/104954921-8c2d2700-59c9-11eb-86d4-6e4f494d0bde.png)
   
   This is compound change of many things - including the fact that most of us can focus on contribution to product, because there are people who constantly keep the machinery not only going but also improving and rolling the wheel better.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761827102


   And I'd say both optimisation (remove other jobs and selective build)  are equally important for me - only when we combine the two we will get real optimization - similar to what we have now when someone changes README.rst or any other "non-doc" folder documentation change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762278361


   > I firmly believe we should optimize those things that have higher impact and bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.
   
   Any optimization that is not just a syntax change has a cost. Sometimes it is just a matter of dedicating time to implementation and reviews.  However, optimizations are very often associated with higher costs. In this case, it seems to me that we have several factors that we must take into account:
   
    - Hardware resources 
    - User experience
    - Reliability
    - Feedback time
   
   We can modify each of these factors independently but each influences the others.  If we try to optimize the feedback time very aggressively, it may affect its reliability. We will have to analyze and solve problems related to documentation, but we will not focus on the contribution itself.  We don't have too many experts who know the documentation process, so any such problem will be quite a pain to invest both for us and for new contributors who will not be aware that they have to overcome certain limitations of our optimization. 
   
   When it comes to feedback, we have a solution that is not overly burdensome for the contributor and is much faster than our CI. It is just building documentation locally where you can check documentation in less than 1 minute for most packages.  Encouraging the use of building documentation locally will also improve the user experience as the user will be able to quickly preview this change locally, but we need to encourage users to do so first.
   
    >  Looking at your example, this should be as easy as finding all ~airflow.providers references and we can build such a graph. 
    
    This is just one example, but it does not cover all possible ways to reference other packages. For example, the user can also reference using ````:ref:`howto/operator:BigtableCreateInstanceOperator` ```` or ```` :doc:`apache-airflow-providers-google:operators/cloud/bigtable.rst` ````. The number of combinations is almost unlimited, and these references can appear in both .rst and docstring files.  You are only sure of this when all these references are resolved, but this is done by Sphinx.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762278361


   > I firmly believe we should optimize those things that have higher impact and bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.
   
   Any optimization that is not just a syntax change has a cost. Sometimes it is just a matter of dedicating time to implementation and reviews.  However, optimizations are very often associated with higher costs. In this case, it seems to me that we have several factors that we must take into account:
   
    - Hardware resources 
    - User experience
    - Reliability
    - Feedback time
   
   We can modify each of these factors independently but each influences the others.  If we try to optimize the feedback time very aggressively, it may affect its reliability. We will have to analyze and solve problems related to documentation, but we will not focus on the contribution itself.  We don't have too many experts who know the documentation process, so any such problem will be quite a pain to invest both for us and for new contributors who will not be aware that they have to overcome certain limitations of our optimization. 
   
   When it comes to feedback, we have a solution that is not overly burdensome for the contributor and is much faster than our CI. It is just building documentation locally where you can check documentation in less than 1 minute for most packages.  Encouraging the use of building documentation locally will also improve the user experience as the user will be able to quickly preview this change locally, but we need to encourage users to do so first.
   
    >  Looking at your example, this should be as easy as finding all ~airflow.providers references and we can build such a graph. 
    
    This is just one example, but it does not cover all possible ways to reference other packages. For example, the user can also reference using :ref:`howto/operator:BigtableCreateInstanceOperator` or :doc:`apache-airflow-providers-google:operators/cloud/bigtable.rst`. The number of combinations is almost unlimited, and these references can appear in both .rst and docstring files.  You are only sure of this when all these references are resolved, but this is done by Sphinx.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761819429


   >All such cross references are already automatically checked (via pre-commit) and stored in https://github.com/apache/airflow/blob/master/airflow/providers/dependencies.json .
   
   This is not the cross-reference we (at least me) are talking about. Google Providers for example reference xcom documentation from apache-airflow (core) docs. Example: https://github.com/apache/airflow/blob/master/docs/apache-airflow-providers-google/operators/marketing_platform/campaign_manager.rst
   
   And similarly apache-airflow (core) docs reference `~airflow.providers.http.operators.http.SimpleHttpOperator` and `~airflow.providers.sqlite.operators.sqlite.SqliteOperator`, `~airflow.providers.jdbc.hooks.jdbc.JdbcHook`
   
   So at the very least, we need to build docs:
   
   - `apache-airflow`
   - `apache-airflow-providers`
   - `apache-airflow-providers-PROVIDER_THAT_CHANGES`
   
   
   >Now I want us to consider what makes more sense:
   > 1. Optimizing #13706 - doc/*.rst documentation-only change
   > 2. Optimizing docs build per-provider - only one provider code+documentation changes
   
   (2) is not possible because of the reason I explained above. We need (3)  Optimizing docs build **core** + **apache-airflow-providers** + per-provider - only one provider code+documentation changes + **core** docs + **apache-airflow-providers**
   
   
   >I'd argue changing only documentation is a bad smell. Documentation should usually be changed together with the code when code change happens. So I argue that those builds will be extremely rare in the future. And they will mostly not impact the people who are making 'substantial' changes - those who need fast feedback on their builds.
   
   Wrong, we encourage doc only changes for new contributors, where they can add a missing section, fix formatting issues, correct outdated information, definitely not a bad smell. Documentation has to be a first-class citizen and important for us as a project, more so because we are an OSS project. The history of doc only changes (based on a number of commits) is also large. And at least documentation will be always evolving (getting better) -- not only now because we released a major version.
   
   >We do not have to build "airflow" docs in this case - there should be "0" references from airflow to particular providers docs.
   I don't think so, it is completely find for just to reference for example a Slack provider to explain sla_miss_callback functionality or Secrets Backend.
   
   Example references:
   ```
   backport-providers.rst:29:Apache Airflow have been moved to the ``airflow.providers`` package.
   backport-providers.rst:79:`README.md <https://github.com/apache/airflow/blob/master/airflow/providers/google/README.md>`_.
   backport-providers.rst:93:``ModuleNotFoundError: No module named 'airflow.providers'``).
   best-practices.rst:207:Similarly, if you have a task that starts a microservice in Kubernetes or Mesos, you should check if the service has started or not using :class:`airflow.providers.http.sensors.http.HttpSensor`.
   concepts.rst:434:by pre-installing some :doc:`apache-airflow-providers:index` packages (they are always available no
   concepts.rst:437:- :class:`~airflow.providers.http.operators.http.SimpleHttpOperator` - sends an HTTP request
   concepts.rst:438:- :class:`~airflow.providers.sqlite.operators.sqlite.SqliteOperator` - SQLite DB operator
   concepts.rst:443:additional packages manually (for example ``apache-airflow-providers-mysql`` package).
   concepts.rst:447:- :class:`~airflow.providers.mysql.operators.mysql.MySqlOperator`
   concepts.rst:448:- :class:`~airflow.providers.postgres.operators.postgres.PostgresOperator`
   concepts.rst:449:- :class:`~airflow.providers.microsoft.mssql.operators.mssql.MsSqlOperator`
   concepts.rst:450:- :class:`~airflow.providers.oracle.operators.oracle.OracleOperator`
   concepts.rst:451:- :class:`~airflow.providers.jdbc.operators.jdbc.JdbcOperator`
   concepts.rst:452:- :class:`~airflow.providers.docker.operators.docker.DockerOperator`
   concepts.rst:453:- :class:`~airflow.providers.apache.hive.operators.hive.HiveOperator`
   concepts.rst:454:- :class:`~airflow.providers.amazon.aws.operators.s3_file_transform.S3FileTransformOperator`
   concepts.rst:455:- :class:`~airflow.providers.mysql.transfers.presto_to_mysql.PrestoToMySqlOperator`,
   concepts.rst:456:- :class:`~airflow.providers.slack.operators.slack.SlackAPIOperator`
   concepts.rst:459:at :doc:`apache-airflow-providers:index`.
   concepts.rst:795:``conn_id`` for the :class:`~airflow.providers.postgres.hooks.postgres.PostgresHook` is
   howto/connection.rst:332:can also add a custom provider that adds custom connection types. See :doc:`apache-airflow-providers:index`
   howto/connection.rst:339::py:class:`~airflow.providers.jdbc.hooks.jdbc.JdbcHook`.
   howto/connection.rst:350:You can read more about details how to add custom provider packages in the :doc:`apache-airflow-providers:index`
   howto/custom-operator.rst:101:See :doc:`connection` for how to create and manage connections and :doc:`apache-airflow-providers:index` for
   howto/custom-operator.rst:268:is :class:`airflow.providers.google.cloud.sensors.gcs.GCSUploadSessionCompleteSensor`.
   howto/define_extra_link.rst:66:all the operators through an airflow plugin or through airflow providers. You can learn more about it in the
   howto/define_extra_link.rst:67::ref:`plugin example <plugin-example>` and in :doc:`apache-airflow-providers:index`.
   howto/define_extra_link.rst:77:tasks using :class:`~airflow.providers.amazon.aws.transfers.gcs_to_s3.GCSToS3Operator` operator.
   howto/define_extra_link.rst:86:  from airflow.providers.amazon.aws.transfers.gcs_to_s3 import GCSToS3Operator
   howto/define_extra_link.rst:112::class:`~airflow.providers.google.cloud.operators.bigquery.BigQueryExecuteQueryOperator` includes a link to the Google Cloud
   howto/define_extra_link.rst:119:    from airflow.providers.google.cloud.operators.bigquery import BigQueryOperator
   howto/define_extra_link.rst:145:As explained in :doc:`apache-airflow-providers:index`, when you create your own Airflow Provider, you can
   howto/define_extra_link.rst:150:by ``apache-airflow-providers-google`` provider currently:
   howto/define_extra_link.rst:155:      - airflow.providers.google.cloud.operators.bigquery.BigQueryConsoleLink
   howto/define_extra_link.rst:156:      - airflow.providers.google.cloud.operators.bigquery.BigQueryConsoleIndexableLink
   howto/define_extra_link.rst:157:      - airflow.providers.google.cloud.operators.mlengine.AIPlatformConsoleLink
   howto/email-config.rst:76:      email_backend = airflow.providers.sendgrid.utils.emailer.send_email
   installation.rst:88:has a corresponding ``apache-airflow-providers-amazon`` providers package to be installed. When you install
   installation.rst:106:see: :doc:`apache-airflow-providers:index`
   installation.rst:108:For the list of the provider packages and what they enable, see: :doc:`apache-airflow-providers:packages-ref`.
   modules_management.rst:272:    apache-airflow-providers-amazon           | 1.0.0b2
   modules_management.rst:273:    apache-airflow-providers-apache-cassandra | 1.0.0b2
   modules_management.rst:274:    apache-airflow-providers-apache-druid     | 1.0.0b2
   modules_management.rst:275:    apache-airflow-providers-apache-hdfs      | 1.0.0b2
   modules_management.rst:276:    apache-airflow-providers-apache-hive      | 1.0.0b2
   operators-and-hooks-ref.rst:23::doc:`apache-airflow-providers:operators-and-hooks-ref/index`.
   plugins.rst:169:    from airflow.providers.amazon.aws.transfers.gcs_to_s3 import GCSToS3Operator
   production-deployment.rst:762:Some operators, such as :class:`airflow.providers.google.cloud.operators.kubernetes_engine.GKEStartPodOperator`,
   production-deployment.rst:763::class:`airflow.providers.google.cloud.operators.dataflow.DataflowStartSqlJobOperator`, require
   production-deployment.rst:869:If you want to establish an SSH connection to the Compute Engine instance, you must have the network address of this instance and credentials to access it. To simplify this task, you can use :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` instead of :class:`~airflow.providers.ssh.hooks.ssh.SSHHook`
   production-deployment.rst:871:The :class:`~airflow.providers.google.cloud.hooks.compute.ComputeEngineHook` support authorization with Google OS Login service. It is an extremely robust way to manage Linux access properly as it stores short-lived ssh keys in the metadata service, offers PAM modules for access and sudo privilege checking and offers nsswitch user lookup into the metadata service as well.
   security/secrets/secrets-backend/index.rst:69:    airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend
   upgrading-to-2.rst:117:    from airflow.providers.docker.operators.docker import DockerOperator
   upgrading-to-2.rst:126:automatically installs the ``apache-airflow-providers-docker`` package.
   upgrading-to-2.rst:131:You can read more about providers at :doc:`apache-airflow-providers:index`.
   upgrading-to-2.rst:507:  * You can read more about providers at :doc:`apache-airflow-providers:index`.
   ```
   
   >I firmly believe we should optimize those things that have higher impact and bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.
   
   We should not fall in the trap of over-optimisation though. We can still get a good reduction in time (that will result in more benefits and hopefully higher impact) as we won't be building docs for all providers (only `apache-airflow`, `apache-airflow-providers` and `apache-airflow-provder-PROVIDER_THAT_CHANGED`.) 
   
   >WDYT @kaxil @mik-laj which of those are worth it? (1) or (2)? I argue (above) that (2) has much higher impact and brings more benefits to the community as whole. If you think we should do (1) rather than (2), I'd love to hear your line of thoughts an reasoning why you think it is better to do it.
   
   As I said above, I would vouch for (1) and (3) not (2) or just (3) where it will supersede (1) too.
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI and Breeze `build-docs` command

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-742499085


   Or maybe even not - I even did not know the --package flag has been added here and I just checked that --help displays all packaages. So Breeze part is done,  only the CI speedup is needed with selective checks.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-817337246


   We can close this one. Introducing parallel building of docs solved this problem in entirely different way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761768795


   All such cross references are already automatically checked (via pre-commit) and stored in https://github.com/apache/airflow/blob/master/airflow/providers/dependencies.json . 
   
   I think we should be then very consistent in what we propose as optimizations and base it on actual gains achieved for everyone (not only for committer who push their change). I want to think 'globally' in terms of regular contributor adding their code not only in "let's optimise the speed of changes that I usually do". 
   
   I returned back to the subject after @mik-laj  raised this: #13706
   
   Now I want us to consider what makes more sense:
   
   1) Optimizing #13706 - doc/*.rst documentation-only change
   
   2) Optimizing docs build per-provider - only one provider code+documentation changes
   
   We are optimizing not  for "now" (i.e. 2.0.1 release and 2.0.0 'cleanups') but also for the next  (say) 5 months of Airflow Builds. 
   
   Whatever we invest now in - will get return of that investment in the coming months. 
   
   Let's compare how often those builds happen (1)  vs (2).
   
   For (1) those are builds that are usually "cleanup and refactor kinds" - we have them now a bit more often because we are  cleaning up stuff - restructuring the docs a little, fixing last spell checking issues.  However in "regular" work. I'd argue changing only documentation is a bad smell. Documentation should usually be changed together with the code when code change happens. So I argue that those builds will be extremely rare in the future. And they will mostly not impact the people who are making 'substantial' changes - those who need fast feedback on their builds.
   
   On the other hand, I think we will have a lot (I think vast majority) of "one-provider-only" changes coming from contributors. Those are already quite optimized during the tests and we can (I plan to)  also easily add 'run tests only for this provider and dependent ones (using the `dependencies.json`). 
   
   This will cut down test time to ~ 5 minutes for typical providers. Now, full docs build takes 25 minutes. If we also limit it to only those providers that changed, this docs build will also take 5 minutes for most of the providers. We do not have to build "airflow" docs in this case  - there should be "0" references from airflow to particular providers docs.
   
   This basically means that a contributor will get feedback about 'one-provider' change 20 minutes faster. Multiply it by a number of "one-provider" changes. 
   
   I have no time to pull the 'hard" data - i.e. how many changes we might have in "regular" 2.* development that are (1) "doc/*.rst" only vs. (2) "single-provider" only. But my intuition tells me that over the next months vast majority of the changes will be of the (2) kind. 
   
   I firmly believe we should optimize those things that have higher impact and  bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.
   
   WDYT @kaxil @mik-laj which of those are worth it? (1) or (2)? I argue (above) that (2) has much higher impact and brings more benefits to the community as whole. If you think we should do (1) rather than (2), I'd love to hear your line of thoughts an reasoning why you think it is better to do it.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762430800


   Surely this is all subjective. And it's soooo easy to forget about things like werkzeug drama (and a number of other similar problems). And about alll the issues flooding us when one day suddenly airflow could not be installed because dependencies broken. Or about mypy/flake errors that a number of people could not reproduce or about failing kubernetes tests because the scripts combined tox, docker image and the K8S setup was failing randomly (which was far more complex than current tests TBH and no-one understood it then anyway). Or about conflicting constraints and totally outdated dependencies and base python images we had for the docker image (which are getting improved)
   
   Most of the complexity of the system we have now is the result of fixing those problems (in the way that prevents them from reoccurring - that's why all those things are constantly running on our CI and double-checking). The complexity is emerging from the complexity of the system we have,  not "reason" on its own.
   
   This is often the case that you fail to see the things that "just work" and forget how much work was put into getting to that "just works" state. It's really, really easy to be overhwelmed by a number of issues like that. If you do not continuously work on fighting such issues (and yest sometimes it means increasing complexity) you might one day wake up in the state that you have *just* enough capacity to fight with the on-going issues. 
   
   It's super easy to underestimate importance of it and put a blind eye on it. I've seen many CI systems that took that slippery road - the problem with it is that once you let it happen it goes down with accelerating speed. System of that size and that many people contributing people is inevitable to take quite some maintenance time.
   
   I wonder if you've thought about the number changes (when CI works reasonable) that we are able to handle because of all the systems we have in place now rather than "just code". If you look at that chart below, we have hardly more active people working on Airflow as 'active commiters' than 1.5 years ago, yet we are able to handle many more changes:
   
   ![Screenshot from 2021-01-18 20-10-42](https://user-images.githubusercontent.com/595491/104954921-8c2d2700-59c9-11eb-86d4-6e4f494d0bde.png)
   
   This is compound change of many things - including the fact that most of us can focus on contribution to product, because there are people who constantly keep the machinery not only going but also improving and rolling the wheels better (even if is more complex).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #12983: Add support selective docs/tests (provider-related) building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-762319487


   > improve your tools,
   
   This is a very subjective word. For some, it is important that the tools are stable, reliable and as simple as possible. For others, it is more important that they are fast, efficient, feature-full, but this can create difficulties with stability and reliability and add more complexity.  I am in favor of introducing these optimizations if we are confident in its stability and reliability, and it is also easy to maintain.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #12983: Add support for the new selective docs building in CI build-docs step

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #12983:
URL: https://github.com/apache/airflow/issues/12983#issuecomment-761768795


   All such cross references are already automatically checked (via pre-commit) and stored in https://github.com/apache/airflow/blob/master/airflow/providers/dependencies.json . 
   
   I think we should be then very consistent in what we propose as optimizations and base it on actual gains achieved for everyone (not only for committer who push their change). I want to think 'globally' in terms of regular contributor adding their code not only in "let's optimise the speed of changes that I usually do". 
   
   I returned back to the subject after @mik-laj  raised this: #13706
   
   Now I want us to consider what makes more sense:
   
   1) Optimizing #13706 - doc/*.rst documentation-only change
   
   2) Optimizing docs build per-provider - only one provider code+documentation changes
   
   We are optimizing not  for "now" (i.e. 2.0.1 release and 2.0.0 'cleanups') but also for the next  (say) 5 months of Airflow Builds. 
   
   Whatever we invest now in - will get return of that investment in the coming months. 
   
   Let's compare how often those builds happen (1)  vs (2).
   
   For (1) those are builds that are usually "cleanup and refactor kinds" - we have them now a bit more often because we are  cleaning up stuff - restructuring the docs a little, fixing last spell checking issues.  However in "regular" work. I'd argue changing only documentation is a bad smell. Documentation should usually be changed together with the code when code change happens. So I argue that those builds will be extremely rare in the future. And they will mostly not impact the people who are making 'substantial' changes - those who need fast feedback on their builds.
   
   On the other hand, I think we will have a lot (I think vast majority) of "one-provider-only" changes coming from contributors. Those are already quite optimized during the tests and we can (I plan to)  also easily add 'run tests only for this provider and dependent ones (using the `dependencies.json`). 
   
   This will cut down test time to ~ 5 minutes for typical providers. Now, full docs build takes 25 minutes. If we also limit it to only those providers that changed, this docs build will also take 5 minutes for most of the providers. We do not have to build "airflow" docs in this case  - there should be "0" references from airflow to particular providers docs.
   
   This basically means that a contributor will get feedback about 'one-provider' change 20 minutes faster. Multiply it by a number of "one-provider" changes. 
   
   I have no time to pull the 'hard" data - i.e. how many changes we have that are (1) "doc/*.rst" only vs. (2) "single-provider" only. But my intuition tells me that over the next months vast majority of the changes will be of the (2) kind. 
   
   I firmly believe we should optimize those things that have higher impact and  bring more benefits. I hate doing micro-optimisations, I always think "long-term"/"high impact" when I am doing it.
   
   WDYT @kaxil @mik-laj which of those are worth it? (1) or (2)? I argue (above) that (2) has much higher impact and brings more benefits to the community as whole. If you think we should do (1) rather than (2), I'd love to hear your line of thoughts an reasoning why you think it is better to do it.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org