You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by GitBox <gi...@apache.org> on 2021/02/14 19:09:44 UTC

[GitHub] [airflow] potiuk opened a new issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

potiuk opened a new issue #14228:
URL: https://github.com/apache/airflow/issues/14228


   I started to experience strange errors when building documentation And I have a reason to believe this is because of left-overs from previous builds on the self hosted machines. For example looks like somewhere in the generated classes there is a get_info class used which normally is only generated (in different directory) during provider packages preparation.
   
   I think this is the manifestation of not "cleaning" everything during the self-hosted build. As discussed before. I think the only way to avoid any such errors is always start from a completely "clean" state when the runner starts building, otherwise it opens up Pandora's box of many similar problems (imagine for example .pyc files for one python version used in another).
   
   I think the only reasonable way to approach it is to what every other CI system does - hard-cleaning of everything, docker cache, sources, log directories - everything just before the new job start. While it might mean it's a waste of time (and cache) as we always have to start from scratch, this avoids multitude of problems and lost developer's time on investigating issues that should not be  on our radar (and that have no easy way to fix them either). Our builds are prepared to quickly restore the cache as needed - they are able to - very quickly usually - bring the CI images from registry (which we use as cache) - in most cases bringing in nessary images is as quick as 1 minute, so caching them locally is not needed. 
   
   Also cleaning help in keeping the environment sane - if we clean-up everything (True Clean State (TM)) before the build, there will be no "growing" logs and other artifacts that might grow as machine is re-used for several jobs.
   
   We need to clean-up everything before the run because there might be many reasons why the job is not cleanly stoped (cancelling job, temporary network failures and the like). And since we have everything in tmpfs it should be as easy as simply removing and recreating tmpfs volume,
   
   Example failed job where I suspect the problem with non-clean state. I run doc build locally and it completed without problems so I suspect the "non-clean" state of the CI machine's job is the root cause.
   
   https://github.com/apache/airflow/pull/14125/checks?check_run_id=1897287904#step:4:16551
   
   
   ```
     Module "airflow.provider_info.schema.j" could not be loaded. Full source will not be available. "error importing 'airflow.provider_info.schema.j' (exception was: ModuleNotFoundError("No module named 'airflow.provider_info'",))"
     reading sources... [ 20%] operators-and-hooks-ref/apache
     
     Traceback (most recent call last):
       File "/usr/local/lib/python3.6/site-packages/sphinx/events.py", line 111, in emit
         results.append(listener.handler(self.app, *args))
       File "/usr/local/lib/python3.6/site-packages/sphinx/ext/viewcode.py", line 155, in env_purge_doc
         for modname, (code, tags, used, refname) in list(modules.items()):
     TypeError: 'bool' object is not iterable
     
     The above exception was the direct cause of the following exception:
     
     Traceback (most recent call last):
       File "/usr/local/lib/python3.6/site-packages/sphinx/cmd/build.py", line 280, in build_main
         app.build(args.force_all, filenames)
       File "/usr/local/lib/python3.6/site-packages/sphinx/application.py", line 352, in build
         self.builder.build_update()
       File "/usr/local/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 298, in build_update
         len(to_build))
       File "/usr/local/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 310, in build
         updated_docnames = set(self.read())
       File "/usr/local/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 417, in read
         self._read_serial(docnames)
       File "/usr/local/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 436, in _read_serial
         self.events.emit('env-purge-doc', self.env, docname)
       File "/usr/local/lib/python3.6/site-packages/sphinx/events.py", line 120, in emit
         (listener.handler, name), exc, modname=modname) from exc
     sphinx.errors.ExtensionError: Handler <function env_purge_doc at 0x7fa233d74378> for event 'env-purge-doc' threw an exception (exception: 'bool' object is not iterable)
     
     Extension error (sphinx.ext.viewcode):
     Handler <function env_purge_doc at 0x7fa233d74378> for event 'env-purge-doc' threw an exception (exception: 'bool' object is not iterable)
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779196607


   @potiuk It's possible this is the cause (and I am doing much cleaning already, just not total nuke and refresh) but in this case I think it is very unlikely to be the problem.
   
   > Similar error here: https://github.com/apache/airflow/pull/14227/checks?check_run_id=1898772208#step:4:16551
   
   In this particular case, the runner two jobs in total:
   
   **CloudWatch Logs Insights**  
   region: eu-central-1  
   log-group-names: GitHubRunners  
   start-time: -86400s  
   end-time: 0s  
   query-string:
   ```
   fields @timestamp, message
   | filter @logStream = 'ip-172-31-25-117'
   | filter message like 'Running job'
   | sort @timestamp desc
   ```
   -----------------------------------------------------------------------------------------------
   |       @timestamp        |                              message                              |
   |-------------------------|-------------------------------------------------------------------|
   | 2021-02-14 19:05:39.617 | 2021-02-14 19:05:39Z: Running job: Build docs                     |
   | 2021-02-14 19:05:39.000 | WRITE LINE: 2021-02-14 19:05:39Z: Running job: Build docs         |
   | 2021-02-14 18:53:07.845 | 2021-02-14 18:53:07Z: Running job: Wait for CI images             |
   | 2021-02-14 18:53:07.000 | WRITE LINE: 2021-02-14 18:53:07Z: Running job: Wait for CI images |
   -----------------------------------------------------------------------------------------------
   
   Since the only previous job was Wait for CI images" I find it hard to believe it would cause 
   
   
   https://github.com/apache/airflow/commit/79ed11f9147483c4b2f6cc99d5a3e41e89d92968
   
   
   https://github.com/sphinx-doc/sphinx/issues/8880
   
   Not a problem with the runner.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779200194


   > (I'm glad I took the time to upload all logs from the runner hosts to Cloudwatch now. https://vector.dev/ is much more powerful than Amazon's Cloudwatch agent.)
   
   Absolutely! the more logs, the better and good interface is important for those!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779202692


   > Glad we have the self-hosted runners now - bigger chance to get it all under control
   
   Yeah, master seems to be "better"/heading in the right direction now at least.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb edited a comment on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
ashb edited a comment on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779196607


   @potiuk It's possible this is the cause (and I am doing much cleaning already, just not total nuke and refresh) but in this case I think it is very unlikely to be the problem.
   
   > Similar error here: https://github.com/apache/airflow/pull/14227/checks?check_run_id=1898772208#step:4:16551
   
   In this particular case, the runner two jobs in total:
   
   **CloudWatch Logs Insights**  
   region: eu-central-1  
   log-group-names: GitHubRunners  
   start-time: -86400s  
   end-time: 0s  
   query-string:
   ```
   fields @timestamp, message
   | filter @logStream = 'ip-172-31-25-117'
   | filter message like 'Running job'
   | sort @timestamp desc
   ```
   -----------------------------------------------------------------------------------------------
   |       @timestamp        |                              message                              |
   |-------------------------|-------------------------------------------------------------------|
   | 2021-02-14 19:05:39.617 | 2021-02-14 19:05:39Z: Running job: Build docs                     |
   | 2021-02-14 19:05:39.000 | WRITE LINE: 2021-02-14 19:05:39Z: Running job: Build docs         |
   | 2021-02-14 18:53:07.845 | 2021-02-14 18:53:07Z: Running job: Wait for CI images             |
   | 2021-02-14 18:53:07.000 | WRITE LINE: 2021-02-14 18:53:07Z: Running job: Wait for CI images |
   -----------------------------------------------------------------------------------------------
   
   Since the only previous job was Wait for CI images" I find it hard to believe it would cause this behaviour.
   
   A google of the error turns up https://github.com/sphinx-doc/sphinx/issues/8880
   
   Which we switched to yesterday https://github.com/apache/airflow/commit/79ed11f9147483c4b2f6cc99d5a3e41e89d92968
   
   Not a problem with the runner.
   
   Please don't leap to conclusions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779198942


   Ah cool. Yeah. looks  like it. Thanks for that. I will fix master constraints now - needs also upper limit in sphinx in setup.py


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb closed issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
ashb closed issue #14228:
URL: https://github.com/apache/airflow/issues/14228


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk edited a comment on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
potiuk edited a comment on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779201314


   BTW. once we get non-failing master (which was a bit neglected due to the mayhem with CI) those kind of updates will only happen automatically after all tests and build passes in master, so it would not have happened :)
   
   Glad we have the self-hosted runners now  - bigger chance to get it all under control


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-778845560


   Similar error here: https://github.com/apache/airflow/pull/14227/checks?check_run_id=1898772208#step:4:16551


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779201314


   BTW. once we get non-failing master (which was a bit neglected due to the mayhem with CI) those kind of updates will only happen automatically after all tests and build passes in master, so it would not have happened :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779199651


   (I'm glad I took the time to upload all logs from the runner hosts to Cloudwatch now. https://vector.dev/ is much more powerful than Amazon's Cloudwatch agent.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] ashb commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
ashb commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779189706


   Looking.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779206777


   Setup.py limit : https://github.com/apache/airflow/pull/14238


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [airflow] potiuk commented on issue #14228: (Likely) left-overs from previous build are breaking the build in self-hosted runeers

Posted by GitBox <gi...@apache.org>.
potiuk commented on issue #14228:
URL: https://github.com/apache/airflow/issues/14228#issuecomment-779195798


   If anything I can help to analyse some of the self-hosted problems - later in the evening more likely. So let me know on slack if there are any unresolved issues/investigation points I can help with - happy to help with those.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org