You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@skywalking.apache.org by GitBox <gi...@apache.org> on 2021/08/02 14:53:43 UTC
[GitHub] [skywalking-python] tom-pytel opened a new pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
tom-pytel opened a new pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138
Discovered this issue while using grpc_proxy in nginx, doesn't seem to happen when connecting directly to grpc server. The presence of the `connected()` protocol method could lead to a deadlock where grpc was disconnected and because of this no other calls to it would be made to reconnect. This is because both the heartbeat and creation of `SpanContext()`s instead of `NoopContext()`s (which would lead to a `report()`) depended on `connected()` being True.
* For this reason removed the `connected()` method entirely since it doesn't even apply to the two other protocols - http and kafka.
* Set `__finished.wait(0)` in the report thread instead of blocking for 3 or 30 seconds because this could be detrimental in high throughput scenarios. Now the only report thread blocking that happens is in the report generators of the various protocols waiting on the queue.
* Also wrapped the various threaded protocol calls in try / except in order to prevent unexpected transient exceptions from killing the threads.
* Added SW_AGENT_MAX_BUFFER_SIZE which sets the maximum queue length like in the Node agent.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] tom-pytel commented on pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
tom-pytel commented on pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138#issuecomment-891792572
Screwed up the local test I guess, removed those options since they were a flawed fix for what this PR fixes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] tom-pytel commented on pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
tom-pytel commented on pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138#issuecomment-905485882
> Setting `__finished.wait(0)` makes it infinitely retry to connect without any delay when the OAP is down, this may keep the CPU unnecessarily busy, what do you think?
It should block in agent/protocol/grpc / http / kafka on the queue wait:
```
timeout = config.QUEUE_TIMEOUT - int(time() - start) # type: int
if timeout <= 0: # this is to make sure we exit eventually instead of being fed continuously
return
segment = queue.get(block=block, timeout=timeout) # type: Segment
```
If you want to increase time between connect attempts then increase `QUEUE_TIMEOUT` in config.py from its current value of 1 second.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] tom-pytel commented on pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
tom-pytel commented on pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138#issuecomment-905503408
But you do have a point, if segments are being generated all the time then this block will not keep connection attempts from being made continuously.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] tom-pytel commented on pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
tom-pytel commented on pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138#issuecomment-905502679
> Well, the `block` is `False` by default...
`True` by default.
```py
def report(self, queue: Queue, block: bool = True):
start = time()
def generator():
while True:
try:
timeout = config.QUEUE_TIMEOUT - int(time() - start) # type: int
if timeout <= 0: # this is to make sure we exit eventually instead of being fed continuously
return
segment = queue.get(block=block, timeout=timeout) # type: Segment
except Empty:
return
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] kezhenxu94 commented on pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138#issuecomment-905495738
Well, the `block` is `False` by default...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] kezhenxu94 commented on pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138#issuecomment-891782548
@tom-pytel this usually means the agent can't even start up because of some changes, in this PR, I think this is the cause
<img width="1092" alt="Screen Shot 2021-08-03 at 19 50 30" src="https://user-images.githubusercontent.com/15965696/128010466-749f434d-9e5d-4df4-a8a1-8b687d2bef36.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] tom-pytel commented on pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
tom-pytel commented on pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138#issuecomment-891202080
Hey @kezhenxu94, consult request, this is weird, this PR fails every test here but mostly succeeds locally:
```
=========================================================== test session starts ============================================================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1 -- /home/tom/venv/bin/python3
cachedir: .pytest_cache
rootdir: /home/tom/src/revdebugpy/skywalking-python
collected 35 items
tests/test_ant_matcher.py::TestFastPathMatch::test_match PASSED [ 2%]
tests/test_counter.py::TestAtomicCounter::test_counter PASSED [ 5%]
tests/test_version_check.py::TestVersionCheck::test_operators PASSED [ 8%]
tests/test_version_check.py::TestVersionCheck::test_version_check PASSED [ 11%]
tests/plugin/sw_aiohttp/test_aiohttp.py::TestPlugin::test_plugin[aiohttp==3.7.3] PASSED [ 14%]
tests/plugin/sw_django/test_django.py::TestPlugin::test_plugin[django==2.0] PASSED [ 17%]
tests/plugin/sw_django/test_django.py::TestPlugin::test_plugin[django==2.2] PASSED [ 20%]
tests/plugin/sw_django/test_django.py::TestPlugin::test_plugin[django==3.0] PASSED [ 22%]
tests/plugin/sw_django/test_django.py::TestPlugin::test_plugin[django==3.1] PASSED [ 25%]
tests/plugin/sw_elasticsearch/test_elasticsearch.py::TestPlugin::test_plugin[elasticsearch==7.9.0] PASSED [ 28%]
tests/plugin/sw_flask/test_flask.py::TestPlugin::test_plugin[flask==1.1.2] PASSED [ 31%]
tests/plugin/sw_flask/test_flask.py::TestPlugin::test_plugin[flask==1.0.4] PASSED [ 34%]
tests/plugin/sw_http/test_http.py::TestPlugin::test_plugin PASSED [ 37%]
tests/plugin/sw_http_wsgi/test_http_wsgi.py::TestPlugin::test_plugin[werkzeug==1.0.1] PASSED [ 40%]
tests/plugin/sw_kafka/test_kafka.py::TestPlugin::test_plugin[kafka-python==2.0.1] PASSED [ 42%]
tests/plugin/sw_psycopg2/test_psycopg2.py::TestPlugin::test_plugin[psycopg2-binary==2.8.6] FAILED [ 45%]
tests/plugin/sw_pymongo/test_pymongo.py::TestPlugin::test_plugin[pymongo==3.11.0] PASSED [ 48%]
tests/plugin/sw_pymysql/test_pymysql.py::TestPlugin::test_plugin[PyMySQL==0.10.0] FAILED [ 51%]
tests/plugin/sw_pyramid/test_plugin.py::TestPlugin::test_plugin[pyramid==1.10] PASSED [ 54%]
tests/plugin/sw_pyramid/test_plugin.py::TestPlugin::test_plugin[pyramid==1.9] PASSED [ 57%]
tests/plugin/sw_rabbitmq/test_rabbitmq.py::TestPlugin::test_plugin[pika==1.1.0] PASSED [ 60%]
tests/plugin/sw_redis/test_redis.py::TestPlugin::test_plugin[redis==3.5.3] PASSED [ 62%]
tests/plugin/sw_requests/test_request.py::TestPlugin::test_plugin[requests==2.24.0] PASSED [ 65%]
tests/plugin/sw_requests/test_request.py::TestPlugin::test_plugin[requests==2.20.0] PASSED [ 68%]
tests/plugin/sw_requests/test_request.py::TestPlugin::test_plugin[requests==2.19.0] PASSED [ 71%]
tests/plugin/sw_requests/test_request.py::TestPlugin::test_plugin[requests==2.13.0] PASSED [ 74%]
tests/plugin/sw_requests/test_request.py::TestPlugin::test_plugin[requests==2.9.0] PASSED [ 77%]
tests/plugin/sw_sanic/test_sanic.py::TestPlugin::test_plugin[sanic==20.3.0] PASSED [ 80%]
tests/plugin/sw_sanic/test_sanic.py::TestPlugin::test_plugin[sanic==20.6.0] PASSED [ 82%]
tests/plugin/sw_sanic/test_sanic.py::TestPlugin::test_plugin[sanic==20.9.0] PASSED [ 85%]
tests/plugin/sw_sanic/test_sanic.py::TestPlugin::test_plugin[sanic==20.9.1] PASSED [ 88%]
tests/plugin/sw_tornado/test_tornado.py::TestPlugin::test_plugin[tornado==6.0.4] PASSED [ 91%]
tests/plugin/sw_tornado/test_tornado.py::TestPlugin::test_plugin[tornado==5.1.1] PASSED [ 94%]
tests/plugin/sw_urllib3/test_urllib3.py::TestPlugin::test_plugin[urllib3==1.25.10] PASSED [ 97%]
tests/plugin/sw_urllib3/test_urllib3.py::TestPlugin::test_plugin[urllib3==1.25.9] PASSED [100%]
```
And the two tests that fail locally are for a completely different reason and they have been failing locally for some time even though they succeed here.
Any insight about why all tests fail?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] tom-pytel merged pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
tom-pytel merged pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking-python] kezhenxu94 commented on pull request #138: fix grpc disconnect, SW_AGENT_MAX_BUFFER_SIZE
Posted by GitBox <gi...@apache.org>.
kezhenxu94 commented on pull request #138:
URL: https://github.com/apache/skywalking-python/pull/138#issuecomment-905481630
Hi @tom-pytel
> Set __finished.wait(0) in the report thread instead of blocking for 3 or 30 seconds because this could be detrimental in high throughput scenarios. Now the only report thread blocking that happens in is the report generators of the various protocols waiting on the queue.
Setting `__finished.wait(0)` makes it infinitely retry to connect without any delay when the OAP is down, this may keep the CPU unnecessarily busy, what do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org