You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@skywalking.apache.org by "Superskyyy (via GitHub)" <gi...@apache.org> on 2023/02/24 18:40:22 UTC

[GitHub] [skywalking] Superskyyy opened a new issue, #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Superskyyy opened a new issue, #10447:
URL: https://github.com/apache/skywalking/issues/10447

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/skywalking/issues?q=is%3Aissue) and found no similar feature requirement.
   
   
   ### Description
   
   After investigtaion, I feel like a drop in replacement change (from kafka-python to confluent-kafka-python) will likely to greatly increase `protocol/kafka` producer thoroughput. 
   
   I believe the problem with current kafka-python mainly comes from not requiring the installation of the hardware accelerated version of https://pypi.python.org/pypi/crc32c package + the library itself blocks on many write operations.
   
   If the outcome is significant I will directly proceed to next release 1.1.0 without waiting for the bigger asyncio refactoring task. 
   
   ### Use case
   
   Kafka-based messages are transmitted in a much better efficiency.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1444246023

   Whoever in the end taking this issue should provide a benchmark result to justify the before and after thoroughput and show the impact on service endpoint latency. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.

FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1454578387

   > @FAWC438 Btw, this maybe a good reason for us to actually attempt to migrate to use a separate reporter process instead of doing inside user application process that can fork unexpectedly (asyncio will be quite painful to implement), plus we will have further performance boost easily. This can be achived during your potential GSOC task, are you interested?
   
   That's great! I'm interested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1445173563

   > I will try to handle this.
   
   Great, once you have progress please don't hesitate to sync here, as this one is rather demanding from user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1447007787

   > I have now basically finished migrating the code from **kafka-python** to **confluent-kafka-python** and successfully running the prefork use case in the demos folder. But no further benchmarking has been done.
   > 
   > 
   > 
   > The most obvious problem is that the **sw_confluent_kafka plugin is completely incompatible** after the change, and it must be manually disabled to run all programs correctly. 
   > 
   > 
   > 
   > The reason for this is that I'm using the latest 2.0.2 version of confluent-kafka-python, and the plugin only supports up to version 1.8.2, which causes unknown errors in several of the main APIs. Given that confluent-kafka-python is based on librdkafka, these bugs seem to be difficult to fix.
   > 
   > 
   > 
   > If it is acceptable to use version 2.0.* of confluent-kafka-python, the sw_confluent_kafka plugin should be disabled, or refactored
   > 
   > 
   > 
   > Speaking of benchmarking, I'm wondering if I should test against all supported versions of pyhton? Also, what is the exact presentation needed for this test result, is it ok to use charts (I don't know how to use the testing tools here yet)
   
   
   Great to see the progress! The plugin-core incompatibility is very normal to see and general rule is we don't force user to use specific versions of core reporter dependencies. (User should be able to choose whatever confluence kafka version they want, don't pin it)
   
   And regarding congruent kafka plugin not working on 2.x, it's totally okay to leave it for now. (Skip test by changing support matrix to Python 3.7: []) We can fix them later. Although be careful that there's something you need to change inside the confluence plugin (refer to the kafka-python plugin, you need to ignore skywalking's own topics, otherwise tracing becomes an endless recursion)
   
   On the other hand, benchmarking using 3.10 will be fine. I'm just curious how much impact this change brings. You can use wrk (an open source tool) to test the thorough-put of the demo fastapi application when Kafka is used. Another good tool to find out performance issue in multithreaded environment will be Yappi (Yet another Python profiler)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.

FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1454577325

   > Great news! It's great to see the perf improvement, that's quite exicting.
   > 
   > Can you provide a minimal working example so I can have a look sometime? I think the prefork model shouldn't break if we don't start agent before forking (current sw-python --prefork flag actually attempts to start agent in master process for gunicorn then listen for fork(), yet libkafka/producer connection may not survive such a fork so it ended up as segmentation fault)
   
   Ok, after I execute [fastapi-gunicorn use case](https://github.com/apache/skywalking-python/blob/master/demo/gunicorn_consumer_prefork.py) using the following command
   
   ```bash
   nohup sw-python -d run -p \
       gunicorn gunicorn_consumer_prefork:app \
       --workers 2 --worker-class uvicorn.workers.UvicornWorker \
       --log-level debug \
       --bind 0.0.0.0:8086 > <log file path>
   ```
   
   Here I got a log of the process from the start of the program until all workers are created successfully and then the keyboard interrupt signal is received
   
   [log file](https://drive.google.com/file/d/1rpwBgHx0ASzNRbWQQxVUXbhEgRkXbCp9/view)
   
   It can be seen that it took about 30 seconds from the start of 2023-03-04 14:00:29 to 2023-03-04 14:01:03 when all worker processes were successfully created, which is also the [default timeout value](https://docs.gunicorn.org/en/stable/settings.html#timeout) of gunicorn. 
   
   Also, if you still want to try running it yourself (considering I haven't submitted a PR yet), you can directly use my modified code to run
   
   ```bash
   git clone -b dev https://github.com/FAWC438/skywalking-python.git
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.

FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1453048524

   Now I have some bad news and good news. 
   
   The good news is that the performance improvement from confluent_kafka is substantial. Here are some of the results of my benchmark tests using [wrk](https://github.com/wg/wrk).
   
   For the [fastapi-gunicorn use case](https://github.com/apache/skywalking-python/blob/master/demo/gunicorn_consumer_prefork.py) in the demo folder, I have done several benchmark tests in the same environment, and here are the more average results. You can see that confluent-kafka-python has **more than twice the performance improvement** compared to kafka-python.
   
   _**kafka-python**_
   
   ```bash
   [root@localhost wrk]# ./wrk -t36 -c4000 -d30s --latency http://10.3.242.223:8087/cat
   Running 30s test @ http://10.3.242.223:8087/cat
     36 threads and 4000 connections
     Thread Stats   Avg      Stdev     Max   +/- Stdev
       Latency     1.41s   336.86ms   1.87s    72.58%
       Req/Sec    44.25     68.59   570.00     89.08%
     Latency Distribution
        50%    1.58s 
        75%    1.62s 
        90%    1.86s 
        99%    1.86s 
     16667 requests in 30.05s, 2.65MB read
     Socket errors: connect 0, read 5861, write 0, timeout 15241
   Requests/sec:    554.71
   Transfer/sec:     90.47KB
   [root@localhost wrk]# 
   ```
   
   _**confluent-kafka-python**_
   
   ```bash
   [root@localhost wrk]# ./wrk -t36 -c4000 -d30s --latency http://10.3.242.223:8086/cat
   Running 30s test @ http://10.3.242.223:8086/cat
     36 threads and 4000 connections
     Thread Stats   Avg      Stdev     Max   +/- Stdev
       Latency     1.20s   426.34ms   1.98s    60.69%
       Req/Sec    98.17    145.79     0.89k    87.40%
     Latency Distribution
        50%    1.16s 
        75%    1.56s 
        90%    1.79s 
        99%    1.92s 
     39687 requests in 30.05s, 6.32MB read
     Socket errors: connect 0, read 3444, write 0, timeout 19562
   Requests/sec:   1320.81
   Transfer/sec:    215.41KB
   [root@localhost wrk]# 
   ```
   
   Although kafka is **still a performance bottleneck**, such a high performance improvement is already a significant improvement for the program
   
   ```bash
   Clock type: CPU
   Ordered by: totaltime, desc
   
   name                                  ncall  tsub      ttot      tavg      
   ..py:321 SkyWalkingAgent.__heartbeat  2      0.000036  0.005567  0.002783
   ..afka.py:45 KafkaProtocol.heartbeat  2      0.000056  0.005531  0.002766
   ..ceManagementClient.send_heart_beat  2      0.000191  0.005465  0.002733
   ..ging/__init__.py:1424 Logger.debug  3      0.000058  0.005001  0.001667
   ..gging/__init__.py:1565 Logger._log  3      0.000089  0.004931  0.001644
   ..__init__.py:1550 Logger.makeRecord  3      0.000032  0.003934  0.001311
   ..__init__.py:282 LogRecord.__init__  3      0.000249  0.003902  0.001301
   ..nt/kafka.py:132 heartbeat_callback  1      0.000006  0.003774  0.003774
   ..hon3.9/abc.py:96 __instancecheck__  1      0.000017  0.003381  0.003381
   ..on3.9/abc.py:100 __subclasscheck__  134/1  0.000170  0.003345  0.000025
   ...
   
   Clock type: CPU
   Ordered by: totaltime, desc
   
   name                                  ncall  tsub      ttot      tavg      
   ..335 SkyWalkingAgent.__report_meter  2      0.000032  0.001472  0.000736
   ...py:141 KafkaProtocol.report_meter  1      0.000069  0.001415  0.001415
   ..KafkaMeterDataReportService.report  1      0.000111  0.000730  0.000730
   ..ging/__init__.py:1424 Logger.debug  1      0.000035  0.000615  0.000615
   ..gging/__init__.py:1565 Logger._log  1      0.000052  0.000575  0.000575
   ..nt/protocol/kafka.py:144 generator  10     0.000067  0.000519  0.000052
   ..b/python3.9/queue.py:154 Queue.get  10     0.000105  0.000344  0.000034
   ..sw_logging.py:40 Logger._sw_handle  1      0.000010  0.000286  0.000286
   ..ing/__init__.py:1591 Logger.handle  1      0.000012  0.000277  0.000277
   ...9/threading.py:280 Condition.wait  3      0.000070  0.000275  0.000092
   ...
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1454490190

   @FAWC438 Btw, this maybe a good reason for us to actually attempt to migrate to use a separate reporter process instead of doing inside user application process that can fork unexpectedly, plus we will have further performance boost easily. This can be achived during your potential GSOC task, are you interested? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] wu-sheng closed issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "wu-sheng (via GitHub)" <gi...@apache.org>.

wu-sheng closed issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka
URL: https://github.com/apache/skywalking/issues/10447


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.

FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1446800146

I have now basically finished migrating the code from **kafka-python** to **confluent-kafka-python** and successfully running the prefork use case in the demos folder. But no further benchmarking has been done.

The most obvious problem is that the **sw_confluent_kafka plugin is completely incompatible** after the change, and it must be manually disabled to run all programs correctly.

The reason for this is that I'm using the latest 2.0.2 version of confluent-kafka-python, and the plugin only supports up to version 1.8.2, which causes unknown errors in several of the main APIs. Given that confluent-kafka-python is based on librdkafka, these bugs seem to be difficult to fix.

If it is acceptable to use version 2.0.* of confluent-kafka-python, the sw_confluent_kafka plugin should be disabled, or refactored

Speaking of benchmarking, I'm wondering if I should test against all supported versions of pyhton? Also, what is the exact presentation needed for this test result, is it ok to use charts (I don't know how to use the testing tools here yet)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.

FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1445012325

   I will try to handle this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1454486475

   Can you provide a minimal working example so I can have a look sometime? I think the prefork model shouldn't break if we don't start agent before forking (current sw-python --prefork flag actually attempts to start for gunicorn, yet libkafka/producer connection may not survive such a fork)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] FAWC438 commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.

FAWC438 commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1453065348

   HOWEVER, there is also a lot of serious bad news. 
   
   First of all, uWSGI doesn't work at all, and according to trackback, I found that this is because of a segment error that raises the C module in librdkafka.
   
   ```bash
   !!! uWSGI process 161797 got Segmentation Fault !!!
   *** backtrace of 161797 ***
   uwsgi(uwsgi_backtrace+0x2e) [0x48ee1e]
   uwsgi(uwsgi_segfault+0x21) [0x48f1b1]
   /lib64/libc.so.6(+0x36400) [0x7f78e300e400]
   /lib64/libc.so.6(+0x13efd6) [0x7f78e3116fd6]
   /lib64/libcrypto.so.10(+0x1265b9) [0x7f78e3b315b9]
   /lib64/libcrypto.so.10(lh_insert+0x50) [0x7f78e3b318a0]
   /lib64/libcrypto.so.10(OBJ_NAME_add+0x6f) [0x7f78e3a7c5ff]
   ...
   uwsgi(python_call+0xf) [0x4a69cf]
   uwsgi(uwsgi_python_post_fork+0x86) [0x4a59c6]
   uwsgi(uwsgi_run+0x21e) [0x493fce]
   uwsgi() [0x43870e]
   /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f78e2ffa555]
   uwsgi() [0x4431d5]
   *** end of backtrace ***
   ```
   
   So I decided to disable confluent-kafka-python completely for uWSGI. 
   
   Second, the Gunicorn module is actually problematic. I found that launching a gunicorn program was very slow, took more than 30 seconds, and consistently failed the github CI test. I found the following in the debug log.
   
   ```bash
   2023-03-03 14:45:56,197 skywalking [pid:164712] [MainThread] [INFO] New process detected, re-initializing SkyWalking Python agent
   2023-03-03 14:45:56,198 skywalking [pid:164712] [MainThread] [DEBUG] heartbeat response: <cimpl.Message object at 0x7fab344e8240>
   2023-03-03 14:45:56,200 skywalking [pid:164712] [MainThread] [DEBUG] Confluent-Kafka local queue cleaned
   2023-03-03 14:45:56,201 skywalking [pid:164712] [MainThread] [DEBUG] Started meter service
   2023-03-03 14:45:56,202 skywalking [pid:164712] [MainThread] [DEBUG] Kafka reporter configs: {'bootstrap.servers': '127.0.0.1:9094'}
   ...
   [2023-03-03 14:45:58 +0800] [164661] [WARNING] Worker with pid 164712 was terminated due to signal 11
   ```
   
   So the problem is that every time gunicorn creates a worker process for the first time, it fails, either because of signal 11 (segment error) or because the worker time out, but the incredible thing is that once gunicorn automatically restarts these processes once and then they run normally, which I think is why gunicorn starts very slowly.
   
   I've tried changing gunicorn's work class (gevent, eventlet or even sync) or changing gunicorn's timeout parameter (from 0 to 300), but to no avail, and sometimes it can't even create a work process at all.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10447: [Feature] Python agent performance enhancement with Confluent-Kafka

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10447:
URL: https://github.com/apache/skywalking/issues/10447#issuecomment-1482089228

   https://github.com/confluentinc/confluent-kafka-python/issues/351#issuecomment-379270066 << kafka client cannot survive a fork confirmed here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org