You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@skywalking.apache.org by "Superskyyy (via GitHub)" <gi...@apache.org> on 2023/02/24 18:40:22 UTC
[GitHub] [skywalking] Superskyyy opened a new issue, #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Superskyyy opened a new issue, #10447:
URL: https://github.com/apache/skywalking/issues/10447
### Search before asking
- [X] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar feature requirement.
### Description
After investigtaion, I feel like a drop in replacement change (from kafka-python to confluent-kafka-python) will likely to greatly increase `protocol/kafka` producer thoroughput.
I believe the problem with current kafka-python mainly comes from not requiring the installation of the hardware accelerated version of https://pypi.python.org/pypi/crc32c package + the library itself blocks on many write operations.
If the outcome is significant I will directly proceed to next release 1.1.0 without waiting for the bigger asyncio refactoring task.
### Use case
Kafka-based messages are transmitted in a much better efficiency.
### Related issues
_No response_
### Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.
Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1444246023
Whoever in the end taking this issue should provide a benchmark result to justify the before and after thoroughput and show the impact on service endpoint latency.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.
FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1454578387
> @FAWC438 Btw, this maybe a good reason for us to actually attempt to migrate to use a separate reporter process instead of doing inside user application process that can fork unexpectedly (asyncio will be quite painful to implement), plus we will have further performance boost easily. This can be achived during your potential GSOC task, are you interested?
That's great! I'm interested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.
Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1445173563
> I will try to handle this.
Great, once you have progress please don't hesitate to sync here, as this one is rather demanding from user.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.
Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1447007787
> I have now basically finished migrating the code from **kafka-python** to **confluent-kafka-python** and successfully running the prefork use case in the demos folder. But no further benchmarking has been done.
>
>
>
> The most obvious problem is that the **sw_confluent_kafka plugin is completely incompatible** after the change, and it must be manually disabled to run all programs correctly.
>
>
>
> The reason for this is that I'm using the latest 2.0.2 version of confluent-kafka-python, and the plugin only supports up to version 1.8.2, which causes unknown errors in several of the main APIs. Given that confluent-kafka-python is based on librdkafka, these bugs seem to be difficult to fix.
>
>
>
> If it is acceptable to use version 2.0.* of confluent-kafka-python, the sw_confluent_kafka plugin should be disabled, or refactored
>
>
>
> Speaking of benchmarking, I'm wondering if I should test against all supported versions of pyhton? Also, what is the exact presentation needed for this test result, is it ok to use charts (I don't know how to use the testing tools here yet)
Great to see the progress! The plugin-core incompatibility is very normal to see and general rule is we don't force user to use specific versions of core reporter dependencies. (User should be able to choose whatever confluence kafka version they want, don't pin it)
And regarding congruent kafka plugin not working on 2.x, it's totally okay to leave it for now. (Skip test by changing support matrix to Python 3.7: []) We can fix them later. Although be careful that there's something you need to change inside the confluence plugin (refer to the kafka-python plugin, you need to ignore skywalking's own topics, otherwise tracing becomes an endless recursion)
On the other hand, benchmarking using 3.10 will be fine. I'm just curious how much impact this change brings. You can use wrk (an open source tool) to test the thorough-put of the demo fastapi application when Kafka is used. Another good tool to find out performance issue in multithreaded environment will be Yappi (Yet another Python profiler)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.
FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1454577325
> Great news! It's great to see the perf improvement, that's quite exicting.
>
> Can you provide a minimal working example so I can have a look sometime? I think the prefork model shouldn't break if we don't start agent before forking (current sw-python --prefork flag actually attempts to start agent in master process for gunicorn then listen for fork(), yet libkafka/producer connection may not survive such a fork so it ended up as segmentation fault)
Ok, after I execute [fastapi-gunicorn use case](https://github.com/apache/skywalking-python/blob/master/demo/gunicorn_consumer_prefork.py) using the following command
```bash
nohup sw-python -d run -p \
gunicorn gunicorn_consumer_prefork:app \
--workers 2 --worker-class uvicorn.workers.UvicornWorker \
--log-level debug \
--bind 0.0.0.0:8086 > <log file path>
```
Here I got a log of the process from the start of the program until all workers are created successfully and then the keyboard interrupt signal is received
[log file](https://drive.google.com/file/d/1rpwBgHx0ASzNRbWQQxVUXbhEgRkXbCp9/view)
It can be seen that it took about 30 seconds from the start of 2023-03-04 14:00:29 to 2023-03-04 14:01:03 when all worker processes were successfully created, which is also the [default timeout value](https://docs.gunicorn.org/en/stable/settings.html#timeout) of gunicorn.
Also, if you still want to try running it yourself (considering I haven't submitted a PR yet), you can directly use my modified code to run
```bash
git clone -b dev https://github.com/FAWC438/skywalking-python.git
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.
FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1453048524
Now I have some bad news and good news.
The good news is that the performance improvement from confluent_kafka is substantial. Here are some of the results of my benchmark tests using [wrk](https://github.com/wg/wrk).
For the [fastapi-gunicorn use case](https://github.com/apache/skywalking-python/blob/master/demo/gunicorn_consumer_prefork.py) in the demo folder, I have done several benchmark tests in the same environment, and here are the more average results. You can see that confluent-kafka-python has **more than twice the performance improvement** compared to kafka-python.
_**kafka-python**_
```bash
[root@localhost wrk]# ./wrk -t36 -c4000 -d30s --latency http://10.3.242.223:8087/cat
Running 30s test @ http://10.3.242.223:8087/cat
36 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.41s 336.86ms 1.87s 72.58%
Req/Sec 44.25 68.59 570.00 89.08%
Latency Distribution
50% 1.58s
75% 1.62s
90% 1.86s
99% 1.86s
16667 requests in 30.05s, 2.65MB read
Socket errors: connect 0, read 5861, write 0, timeout 15241
Requests/sec: 554.71
Transfer/sec: 90.47KB
[root@localhost wrk]#
```
_**confluent-kafka-python**_
```bash
[root@localhost wrk]# ./wrk -t36 -c4000 -d30s --latency http://10.3.242.223:8086/cat
Running 30s test @ http://10.3.242.223:8086/cat
36 threads and 4000 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 1.20s 426.34ms 1.98s 60.69%
Req/Sec 98.17 145.79 0.89k 87.40%
Latency Distribution
50% 1.16s
75% 1.56s
90% 1.79s
99% 1.92s
39687 requests in 30.05s, 6.32MB read
Socket errors: connect 0, read 3444, write 0, timeout 19562
Requests/sec: 1320.81
Transfer/sec: 215.41KB
[root@localhost wrk]#
```
Although kafka is **still a performance bottleneck**, such a high performance improvement is already a significant improvement for the program
```bash
Clock type: CPU
Ordered by: totaltime, desc
name ncall tsub ttot tavg
..py:321 SkyWalkingAgent.__heartbeat 2 0.000036 0.005567 0.002783
..afka.py:45 KafkaProtocol.heartbeat 2 0.000056 0.005531 0.002766
..ceManagementClient.send_heart_beat 2 0.000191 0.005465 0.002733
..ging/__init__.py:1424 Logger.debug 3 0.000058 0.005001 0.001667
..gging/__init__.py:1565 Logger._log 3 0.000089 0.004931 0.001644
..__init__.py:1550 Logger.makeRecord 3 0.000032 0.003934 0.001311
..__init__.py:282 LogRecord.__init__ 3 0.000249 0.003902 0.001301
..nt/kafka.py:132 heartbeat_callback 1 0.000006 0.003774 0.003774
..hon3.9/abc.py:96 __instancecheck__ 1 0.000017 0.003381 0.003381
..on3.9/abc.py:100 __subclasscheck__ 134/1 0.000170 0.003345 0.000025
...
Clock type: CPU
Ordered by: totaltime, desc
name ncall tsub ttot tavg
..335 SkyWalkingAgent.__report_meter 2 0.000032 0.001472 0.000736
...py:141 KafkaProtocol.report_meter 1 0.000069 0.001415 0.001415
..KafkaMeterDataReportService.report 1 0.000111 0.000730 0.000730
..ging/__init__.py:1424 Logger.debug 1 0.000035 0.000615 0.000615
..gging/__init__.py:1565 Logger._log 1 0.000052 0.000575 0.000575
..nt/protocol/kafka.py:144 generator 10 0.000067 0.000519 0.000052
..b/python3.9/queue.py:154 Queue.get 10 0.000105 0.000344 0.000034
..sw_logging.py:40 Logger._sw_handle 1 0.000010 0.000286 0.000286
..ing/__init__.py:1591 Logger.handle 1 0.000012 0.000277 0.000277
...9/threading.py:280 Condition.wait 3 0.000070 0.000275 0.000092
...
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.
Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1454490190
@FAWC438 Btw, this maybe a good reason for us to actually attempt to migrate to use a separate reporter process instead of doing inside user application process that can fork unexpectedly, plus we will have further performance boost easily. This can be achived during your potential GSOC task, are you interested?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] wu-sheng closed issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "wu-sheng (via GitHub)" <gi...@apache.org>.
wu-sheng closed issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
URL: https://github.com/apache/skywalking/issues/10447
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.
FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1446800146
I have now basically finished migrating the code from **kafka-python** to **confluent-kafka-python** and successfully running the prefork use case in the demos folder. But no further benchmarking has been done.
The most obvious problem is that the **sw_confluent_kafka plugin is completely incompatible** after the change, and it must be manually disabled to run all programs correctly.
The reason for this is that I'm using the latest 2.0.2 version of confluent-kafka-python, and the plugin only supports up to version 1.8.2, which causes unknown errors in several of the main APIs. Given that confluent-kafka-python is based on librdkafka, these bugs seem to be difficult to fix.
If it is acceptable to use version 2.0.* of confluent-kafka-python, the sw_confluent_kafka plugin should be disabled, or refactored
Speaking of benchmarking, I'm wondering if I should test against all supported versions of pyhton? Also, what is the exact presentation needed for this test result, is it ok to use charts (I don't know how to use the testing tools here yet)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.
FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1445012325
I will try to handle this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.
Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1454486475
Can you provide a minimal working example so I can have a look sometime? I think the prefork model shouldn't break if we don't start agent before forking (current sw-python --prefork flag actually attempts to start for gunicorn, yet libkafka/producer connection may not survive such a fork)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.
FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1453065348
HOWEVER, there is also a lot of serious bad news.
First of all, uWSGI doesn't work at all, and according to trackback, I found that this is because of a segment error that raises the C module in librdkafka.
```bash
!!! uWSGI process 161797 got Segmentation Fault !!!
*** backtrace of 161797 ***
uwsgi(uwsgi_backtrace+0x2e) [0x48ee1e]
uwsgi(uwsgi_segfault+0x21) [0x48f1b1]
/lib64/libc.so.6(+0x36400) [0x7f78e300e400]
/lib64/libc.so.6(+0x13efd6) [0x7f78e3116fd6]
/lib64/libcrypto.so.10(+0x1265b9) [0x7f78e3b315b9]
/lib64/libcrypto.so.10(lh_insert+0x50) [0x7f78e3b318a0]
/lib64/libcrypto.so.10(OBJ_NAME_add+0x6f) [0x7f78e3a7c5ff]
...
uwsgi(python_call+0xf) [0x4a69cf]
uwsgi(uwsgi_python_post_fork+0x86) [0x4a59c6]
uwsgi(uwsgi_run+0x21e) [0x493fce]
uwsgi() [0x43870e]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f78e2ffa555]
uwsgi() [0x4431d5]
*** end of backtrace ***
```
So I decided to disable confluent-kafka-python completely for uWSGI.
Second, the Gunicorn module is actually problematic. I found that launching a gunicorn program was very slow, took more than 30 seconds, and consistently failed the github CI test. I found the following in the debug log.
```bash
2023-03-03 14:45:56,197 skywalking [pid:164712] [MainThread] [INFO] New process detected, re-initializing SkyWalking Python agent
2023-03-03 14:45:56,198 skywalking [pid:164712] [MainThread] [DEBUG] heartbeat response: <cimpl.Message object at 0x7fab344e8240>
2023-03-03 14:45:56,200 skywalking [pid:164712] [MainThread] [DEBUG] Confluent-Kafka local queue cleaned
2023-03-03 14:45:56,201 skywalking [pid:164712] [MainThread] [DEBUG] Started meter service
2023-03-03 14:45:56,202 skywalking [pid:164712] [MainThread] [DEBUG] Kafka reporter configs: {'bootstrap.servers': '127.0.0.1:9094'}
...
[2023-03-03 14:45:58 +0800] [164661] [WARNING] Worker with pid 164712 was terminated due to signal 11
```
So the problem is that every time gunicorn creates a worker process for the first time, it fails, either because of signal 11 (segment error) or because the worker time out, but the incredible thing is that once gunicorn automatically restarts these processes once and then they run normally, which I think is why gunicorn starts very slowly.
I've tried changing gunicorn's work class (gevent, eventlet or even sync) or changing gunicorn's timeout parameter (from 0 to 300), but to no avail, and sometimes it can't even create a work process at all.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.
Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1482089228
https://github.com/confluentinc/confluent-kafka-python/issues/351#issuecomment-379270066 << kafka client cannot survive a fork confirmed here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org