You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/03/24 18:15:56 UTC

[GitHub] [airflow] potiuk opened a new issue #14989: Make Docs builds fallback in case external docs sources are missing

potiuk opened a new issue #14989:
URL: https://github.com/apache/airflow/issues/14989


   Every now and then our docs builds start to fail because of external dependency (latest example here #14985). And while we are doing caching now of that information, it does not help when the initial retrieval fails. This information does not change often but with the number of dependencies we have it will continue to fail regularly simply because many of those depenencies are not very reliable - they are just a web page hosted somewhere. They are nowhere near the stabilty of even PyPI or Apt sources and we have no mirroring in case of problem.
   
   Maybe we could 
   
   a) see if we can use some kind of mirroring scheme (do those sites have mirrrors ? )
   b) if not, simply write a simple script that will dump the cached content for those to S3, refresh it in the CI scheduled (nightly) master builds ad have a fallback mechanism to download that from there in case of any problems in CI?
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-807590926


   I think I’ve it mostly figured out. Since we are already downloading the inventory locally ourselves into `docs/_inventory_cache`, we don’t really need to use the tuple form mentioned above. Instead, we can leave `conf.py` unchanged, but instead change `fetch_inventories.py` to check the mirror if the primary source fails during `_inventory_cache` population.
   
   As mentioned above, the mirror inventory can be populated periodically. So a can will be added to `ci.yml` to fetch the inventory and populate them to S3.
   
   Does that sound like a plan?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808479837


   We should make sure that the documentation for our packages is also built with the cache from the S3 bucket. In some cases, the Github Actiion cache may be empty, which will make our build take a very long time and it is very likely that we will have a timeout.  We can skip retrieving inventories from the bucket if we can restore the files from Github Acttion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/airflow-docs
   
   By default, all builds should use the cache. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808438815


   > Using GitHub Action’s cache sounds like a cool idea. The only concern I have is the cache can be pretty difficult to purge if something goes wrong (maybe just me not knowing a good way to do it). Maybe we should have a “cache busting” part to the key (e.g. cache-inventory-${{ hashFiles('constrant-3.8.txt') }}-${{ cacheKey }} where cacheKey is a variable in the Action’s config) so we can force a cache miss when we need to.
   
   In this case we will simply change the cache prefix. We've done that several times with other caches. We can always turn it into:
   
   ```
   key: cache-inventory-v2-${{ hashFiles('constraints-3.8.txt') }}
   restore-key: cache-inventory-v2-
   ```
   
   by changing the .yml file. No need for special variable for that -t can simply be changed in ci.yml. And I think this cache will be self-healing. As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   > I just thought of another thing, does Action’s cache set file creation date etc. correctly? Because if it does not, we won’t be able to fetch in new documentation updates (say the `1.4` version of a documentation releases a fix without changing the URL) since the files we retrieve from the cache will always be newer.
   
   Highly doubt it. Even if they do, this is not a big problem for us. First of all the cache will always be secondary choice - so even if we have it from the cache, the "URL one" will take precedence (Or this is at least how it works ). And secondly the cache will get "eventually consistent". Every time we change constraints (should happen regularly - every day or so) the cache will get refreshed because while we will have it as secondary choice, the main one (URL pull) will still get the latest version and override and then new version of the cache will be produced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808280458


   I just thought of another thing, does Action’s cache set file creation date etc. correctly? Because if it does not, we won’t be able to fetch in new documentation updates (say the `1.4` version of a documentation releases a fix without changing the URL) since the files we retrieve from the cache will always be newer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-811042552


   The main issue with pulling versions from the constraints file to format is not during the fetching part, but how the same version can be passed to Sphinx, since the URLs are also needed in `conf.py`. So if we use version-ed URLs, `conf.py` would also need to grow some kind of mechanism to correctly interpret versions of the dependencies, which seems awkward to me.
   
   A better approach IMO might be to directly hard-code those versions in `third_party_inventories.py` instead, and have some kind of automated process to update that file together with the constraints files. This also means we don’t need to use the constraints file as the GitHub Action cache key, but can use `third_party_inventories.py` instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-810888525


   Sorry, I was too vague in the previous comment. In the solution outlined by @potiuk, we would need to download the third-party inventories when the GHA cache is empty (which you said may happen). If any of the downloads fail (which would not fail the build), we’d end up with an incomplete local inventory cache. What should happen in this situation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/airflow-docs
   If documentation for a package has already been built, the locally built documentation inventory is used, not the cache i.e. prefer newer inwentories. 
   
   The first package to be built should only use cached data. There should never be new inventories from the internet when building.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-810854805


   Currently, the cache(with our inventories) is never empty.  We always have access to the archival version of the documentation. If GitHub Action fails to retrieve some thirty-party inventory, the building will fail. This is what we want to fix in this ticket.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-806507648


   Mirroring sounds like a good idea. What should happen if the nightly S3 bucket update fails?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/airflow-docs
   If documentation for a package has already been built, the locally built documentation inventory is used, not the cache.
   
   By default, all builds should use the cache. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808194362


   Looks great! @uranusjr !
   
   I have another thought in the meantime. Why not using caching of GitHub Actions: https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows - the inventory files are not big (2.3 M):
   
   ```
   [jarek:~/code/airflow/docs] master+ ± du -h --summarize _inventory_cache
   2.3M	_inventory_cache
   ```
   
   So we could simply cache it and use already existing inventory cache mechanism this way (we are already using the _inventory_cache as fallback)?  That will probably be way simpler than setting up S3 and managing it periodically?
   
   Also we could improve it to automatically add  the "exact" version of each library we use in each inventory URL, that would make it much better - currently we are taking "latest" or "stable" but in fact this is wrong - we should look at our constraints file and use the exact version of the library we use - this should be rather simple to retrieve from constraints file. That would also make caching much more efficient - we could invalidate cache whenever any of the libraries change and get it rebuilt. And this works really nicely if we use `restore-keys:` part of caching already present in GithubAction.
   
   Some pointers: 
   
   * we could take current library versions from : https://raw.githubusercontent.com/apache/airflow/constraints-master/constraints-3.8.txt
   * we could simply download the constraint file and it could be used to calculate cache validity:
             with this key: cache-inventory-${{ hashFiles('constrant-3.8.txt') }}
             and this restore-keys: cache-inventory-
             the restore-keys work in the way that if the hash of such constraint file changes, it will download the "latest available" cache matching the prefix and use it as a base (but then after it is rebuilt, it will upload a new version of cache with the changed hash for next job to use it.
   * and we could also use the constraints derive the right URLs when building inventory URLs to download the inventory from
       
   WDYT? Happy to help and brainstorm on it, but this would be much simpler operationally (no separate S3 folder, no access needed etc. ) and more "correct" in terms of the documentation generated.
       
       
      


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/apache-airflow.
   
   By default, all builds should use the cache. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil closed issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
kaxil closed issue #14989:
URL: https://github.com/apache/airflow/issues/14989


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/airflow-docs
   If documentation for a package has already been built, the locally built documentation inventory is used, not the cache i.e. prefer newer inwentory. 
   
   The first package to be built should only use cached data. There should never be new inventories from the internet when building.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/airflow-docs
   If documentation for a package has already been built, the locally built documentation inventory is used, not the cache i.e. prefer never inwentories. 
   
   The first package to be built should only use cached data. There should never be new inventories from the internet when building.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808123164


   I think the build job in CI should always use the S3 cache - it is much more predictable and under our control that way.
   
   And if we have the mirror as a separate workflow, then it can be triggered manually if we want to update it sooner than the nightly job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/apache-docs
   
   By default, all builds should use the cache. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808440168


   > We’ll probably need check if each third-party has a versioned documentation as well (not a given). And if one doesn’t, just default to latest?
   
   Yeah. I think this is part of the solution, but I guess it will be easy by just running the inventory pull and seeing when it fails  (and make some exception list).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808470146


   > If it is a third -party library, we download the inventory from the project documentation
   
   So I guess the solution is to : 
   
   1) Restore cache always from github's cache with restore-key: fallback when hash of constraints changed
   2) Try to download from 3rd-party to that cache (and if we can't load we just skip the download rather than fail the build)
   3) Build the docs from cache
   
   Then if the fallback cache has been chosen the new cache will be created using the newly updated cache
   
   We can even filter out the the constraint file to only contain the 'relevant' libraries and base our hash on that filtered file. Then the cache will only get updated when we update to a newer version of relevant libraries only.
   
   That looks pretty sound to me @uranusjr @mik-laj  WDYT ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http.//s.apache.org/apache-airflow.
   
   By default, all builds should use the cache. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808194362


   Looks great! @uranusjr !
   
   I have another thought in the meantime. Why not using caching of GitHub Actions: https://docs.github.com/en/actions/guides/caching-dependencies-to-speed-up-workflows - the inventory files are not big (2.3 M):
   
   ```
   [jarek:~/code/airflow/docs] master+ ± du -h --summarize _inventory_cache
   2.3M	_inventory_cache
   ```
   
   So we could simply cache use already existing cache mechanism this way?  That will probably be way simpler than setting up S3 and managing it periodically?
   
   Also we could improve it to automatically add  the "exact" version of each library we use in each inventory URL, that would make it much better - currently we are taking "latest" or "stable" but in fact this is wrong - we should look at our constraints file and use the exact version of the library we use - this should be rather simple to retrieve from constraints file. That would also make caching much more efficient - we could invalidate cache whenever any of the libraries change and get it rebuilt. And this works really nicely if we use `restore-keys:` part of caching already present in GithubAction.
   
   Some pointers: 
   
   * we could take current library versions from : https://raw.githubusercontent.com/apache/airflow/constraints-master/constraints-3.8.txt
   * we could simply download the constraint file and it could be used to calculate cache validity:
             with this key: cache-inventory-${{ hashFiles('constrant-3.8.txt') }}
             and this restore-keys: cache-inventory-
             the restore-keys work in the way that if the hash of such constraint file changes, it will download the "latest available" cache matching the prefix and use it as a base (but then after it is rebuilt, it will upload a new version of cache with the changed hash for next job to use it.
   * and we could also use the constraints derive the right URLs when building inventory URLs to download the inventory from
       
   WDYT? Happy to help and brainstorm on it, but this would be much simpler operationally (no separate S3 folder, no access needed etc. ) and more "correct" in terms of the documentation generated.
       
       
      


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808438815


   > Using GitHub Action’s cache sounds like a cool idea. The only concern I have is the cache can be pretty difficult to purge if something goes wrong (maybe just me not knowing a good way to do it). Maybe we should have a “cache busting” part to the key (e.g. cache-inventory-${{ hashFiles('constrant-3.8.txt') }}-${{ cacheKey }} where cacheKey is a variable in the Action’s config) so we can force a cache miss when we need to.
   
   In this case we will simply change the cache prefix. We've done that several times with other caches. We can always turn it into:
   
   ```
   key: cache-inventory-v2-${{ hashFiles('constraints-3.8.txt') }}
   restore-key: cache-inventory-v2-
   ```
   
   by changing the .yml file. No need for special variable for that it can simply be changed in ci.yml. And I think this cache will be self-healing. As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   > I just thought of another thing, does Action’s cache set file creation date etc. correctly? Because if it does not, we won’t be able to fetch in new documentation updates (say the `1.4` version of a documentation releases a fix without changing the URL) since the files we retrieve from the cache will always be newer.
   
   Highly doubt it. Even if they do, this is not a big problem for us. First of all the cache will always be secondary choice - so even if we have it from the cache, the "URL one" will take precedence (Or this is at least how it works ). And secondly the cache will get "eventually consistent". Every time we change constraints (should happen regularly - every day or so) the cache will get refreshed because while we will have it as secondary choice, the main one (URL pull) will still get the latest version and override and then new version of the cache will be produced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-807804417


   First stab in #15024.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-806284267


   Looks like our sphinx can handle it, but mirroring seems to be a  feature of intersphinx mapping:
   
   ```
   New in 1.3.
   
   Alternative files can be specified for each inventory. One can give a tuple for the second inventory tuple item as shown in the following example. This will read the inventory iterating through the (second) tuple items until the first successful fetch. The primary use case for this to specify mirror sites for server downtime of the primary inventory:
   
   intersphinx_mapping = {'python': ('https://docs.python.org/3',
                                     (None, 'python-inv.txt'))}
   ```
   
   I think it is just a matter of:
   
   a) automatically fetching the inventories from time to time
   b) putting them on our S3 bucket with "web" property
   c) adding automated mirroring from the S3 bucket.
   d) we can even automatically update those S3 bucket inventories in our nightly scheduled build.
   
   Should be easy to do and can save us a LOT of hassle.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] kaxil commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
kaxil commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-806699696


   > Mirroring sounds like a good idea. What should happen if the nightly S3 bucket update fails?
   
   I think we somehow need to ignore the inventories if possible or building those providers if inventories are absolutely required. Would you like to take a stab at this @uranusjr :) ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808479837


   We should make sure that the documentation for our packages is also built with the cache from the S3 bucket. In some cases, the Github Actiion cache may be empty, which will cause our build to continue. very very long and it is very likely that we will have a timeout.  But we can skip retrieving inventories from the bucket if we can restore the files from Github Acttion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/airflow-docs
   If documentation for a package has already been built, the locally built documentation inventory is used, not the cache i.e. prefer newer inwentory. 
   
   By default, all builds should use the cache. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808438815


   > Using GitHub Action’s cache sounds like a cool idea. The only concern I have is the cache can be pretty difficult to purge if something goes wrong (maybe just me not knowing a good way to do it). Maybe we should have a “cache busting” part to the key (e.g. cache-inventory-${{ hashFiles('constrant-3.8.txt') }}-${{ cacheKey }} where cacheKey is a variable in the Action’s config) so we can force a cache miss when we need to.
   
   In this case we will simply change the cache prefix. We've done that several times with other caches. We can always turn it into:
   
   ```
   key: cache-inventory-v2-${{ hashFiles('constraints-3.8.txt') }}
   restore-key: cache-inventory-v2-
   ```
   
   by changing the .yml file. No need for special variable for that it can simply be changed in ci.yml. And I think this cache will be self-healing. As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   > I just thought of another thing, does Action’s cache set file creation date etc. correctly? Because if it does not, we won’t be able to fetch in new documentation updates (say the `1.4` version of a documentation releases a fix without changing the URL) since the files we retrieve from the cache will always be newer.
   
   Highly doubt it. Even if they do, this is not a big problem for us. First of all the cache will always be secondary choice - so even if we have it from the cache, the "URL one" will take precedence (Or this is at least I think how it works ). And secondly the cache will get "eventually consistent". Every time we change constraints (should happen regularly - every day or so) the cache will get refreshed because while we will have it as secondary choice, the main one (URL pull) will still get the latest version and override and then new version of the cache will be produced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808267669


   Using GitHub Action’s cache sounds like a cool idea. The only concern I have is the cache can be pretty difficult to purge if something goes wrong (maybe just me not knowing a good way to do it). Maybe we should have a “cache busting” part to the key (e.g. `cache-inventory-${{ hashFiles('constrant-3.8.txt') }}-${{ cacheKey }}` where `cacheKey` is a variable in the Action’s config) so we can force a cache miss when we need to.
   
   Generally constraints (and requirements) file syntax is considered pip internals, but the files in this case are entirely managed by Airflow anyway, so that’s not really a concern.
   
   We’ll probably need check if each third-party has a versioned documentation as well (not a given). And if one doesn’t, just default to `latest`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808438815


   > Using GitHub Action’s cache sounds like a cool idea. The only concern I have is the cache can be pretty difficult to purge if something goes wrong (maybe just me not knowing a good way to do it). Maybe we should have a “cache busting” part to the key (e.g. cache-inventory-${{ hashFiles('constrant-3.8.txt') }}-${{ cacheKey }} where cacheKey is a variable in the Action’s config) so we can force a cache miss when we need to.
   
   In this case we will simply change the cache prefix. We've done that several time. we can always turn it into:
   
   ```
   key: cache-inventory-v2-${{ hashFiles('constraints-3.8.txt') }}
   restore-key: cache-inventory-v2-
   ```
   
   by changing the .yml file. No need for special variable for that -t can simply be changed in ci.yml. And I think this cache will be self-healing. As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   > I just thought of another thing, does Action’s cache set file creation date etc. correctly? Because if it does not, we won’t be able to fetch in new documentation updates (say the `1.4` version of a documentation releases a fix without changing the URL) since the files we retrieve from the cache will always be newer.
   
   Highly doubt it. Even if they do, this is not a big problem for us. First of all the cache will always be secondary choice - so even if we have it from the cache, the "URL one" will take precedence (Or this is at least how it works ). And secondly the cache will get "eventually consistent". Every time we change constraints (should happen regularly - every day or so) the cache will get refreshed because while we will have it as secondary choice, the main one (URL pull) will still get the latest version and override and then new version of the cache will be produced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-811025877


   Second go: #15109


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-806755052


   Sure! Could you provide some pointers where I should start looking for the code/configuration for the nightly builds?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr commented on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr commented on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-810834878


   What happens if the cache is empty, *and* the download fails? Would this simply fails the doc build?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-806284267


   Looks like our sphinx can handle it, but mirroring seems to be a  feature of intersphinx mapping:
   
   ```
   New in 1.3.
   
   Alternative files can be specified for each inventory. One can give a tuple for the second inventory tuple item as shown in the 
   following example. This will read the inventory iterating through the (second) tuple items until the first successful fetch. The
    primary use case for this to specify mirror sites for server downtime of the primary inventory:
   
   
   intersphinx_mapping = {'python': ('https://docs.python.org/3',
                                     (None, 'python-inv.txt'))}
   ```
   
   I think it is just a matter of:
   
   a) automatically fetching the inventories from time to time
   b) putting them on our S3 bucket with "web" property
   c) adding automated mirroring from the S3 bucket.
   d) we can even automatically update those S3 bucket inventories in our nightly scheduled build.
   
   Should be easy to do and can save us a LOT of hassle.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/airflow-docs
   If documentation for a package has already been built, the locally built documentation inventory is used, not the cache i.e. prefer newer inwentories. 
   
   The first package uses only use cached data. There should never be new inventories from the internet when building.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] mik-laj edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
mik-laj edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-808448643


   > As I understand it the inventory we are using now simply downloads URL and only if it fails, it reads it from the cache @mik-laj might confirm it maybe?
   
   Not exactly. Before building, we download all inventories to the cache. If it is a third -party library, we download the inventory from the project documentation.  If this is our package then we download it from the S3 bucket -http://s.apache.org/airflow-docs
   If documentation for a package has already been built, the locally built documentation inventory is used, not the cache i.e. prefer newer inwentory. 
   
   The first package to be built should only use cached data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] uranusjr edited a comment on issue #14989: Make Docs builds fallback in case external docs sources are missing

Posted by GitBox <gi...@apache.org>.
uranusjr edited a comment on issue #14989:
URL: https://github.com/apache/airflow/issues/14989#issuecomment-807590926


   I think I’ve it mostly figured out. Since we are already downloading the inventory locally ourselves into `docs/_inventory_cache`, we don’t really need to use the tuple form mentioned above. Instead, we can leave `conf.py` unchanged, but instead change `fetch_inventories.py` to check the mirror if the primary source fails during `_inventory_cache` population.
   
   As mentioned above, the mirror inventory can be populated periodically. So a can will be added to `ci.yml` to fetch the inventory and populate them to S3.
   
   Does that sound like a plan? I’ll try to come up with something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org