You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@skywalking.apache.org by "Superskyyy (via GitHub)" <gi...@apache.org> on 2023/04/11 14:27:44 UTC

[GitHub] [skywalking] Superskyyy opened a new issue, #10672: [Bug] Python agent critical performance regression

Superskyyy opened a new issue, #10672:
URL: https://github.com/apache/skywalking/issues/10672

   Below is quoted from @FAWC438, the root cause is found and pending investigation on what exact changed that introduced the regression. After fixing this issue, a new release will be immediately published.
   
   I seem to have found where the problem is. 
   
   [These codes](https://github.com/apache/skywalking-python/blob/128278a1d61b6a92e67ee9c67cc2836a91370bbc/skywalking/agent/__init__.py#L310-L318) in  agent/\_\_init\_\_.py cause the bug.
   
   These codes results in a timeout waiting time of **0** forever when fetching the in-queue elements of the segment and log, which means that the thread has been hogging the CPU without any time to “Rest” it. Meanwhile, the `backoff_decorator` uses `threading.Event().wait()` , which in Python 3 is implemented in C at the bottom, means that it is not restricted by GIL, resulting in multi-core CPU usage easily exceeding 100.0.
   
   To solve this bug , just set the parameter `init_wait` in the decorator of the two functions to a small non-zero positive number (as shown below, I set it to 0.01).
   
   ```python
       @report_with_backoff(reporter_name='segment', init_wait=0.01)
       def __report_segment(self) -> None:
           if not self.__segment_queue.empty():
               self.__protocol.report_segment(self.__segment_queue)
   
       @report_with_backoff(reporter_name='log', init_wait=0.01)
       def __report_log(self) -> None:
           if not self.__log_queue.empty():
               self.__protocol.report_log(self.__log_queue)
   ```
   
   The following is a comparison of CPU usage _before_ and _after_ making this change:
   
   _before_
   
   ```bash
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                             
   184153 root      20   0 2400796 136480  18488 S 115.0  0.2   0:18.67 python 
   ```
   
   
   _after_
   
   ```bash
   
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                             
   181723 root      20   0 2475044 137004  18392 S   3.0  0.2   0:07.43 python
   ```
   
   _Originally posted by @FAWC438 in https://github.com/apache/skywalking/discussions/10642#discussioncomment-5579278_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] wu-sheng commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "wu-sheng (via GitHub)" <gi...@apache.org>.

wu-sheng commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1504414600

   Further more, considering the critical, you could choose whether wait for 72h or not in the voting process.
   CVE fix and Critical fix could skip this transition for benefit the end user if you feel necessary.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy closed issue #10672: [Bug] Python agent critical performance regression

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy closed issue #10672: [Bug] Python agent critical performance regression
URL: https://github.com/apache/skywalking/issues/10672


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1503638043

   Right, using `event.wait(0)` is very close to busy waiting to me, since it constantly checks for if the event is set. It will cause high cpu usage.  Though I also don't know what happen previously and why it didn't cause similar performance degradation, will need a deeper look into the previous code.
   
   But meanwhile I think its good to add a wait(0.01 or even a bit more), the default 0 is definitely not friendly for production envs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] wu-sheng commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "wu-sheng (via GitHub)" <gi...@apache.org>.

wu-sheng commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1503651856

   About the default value, what does this use for? Is it about consuming from an in-memory queue for gRPC report? If so, it doesn't have to be so little.
   The Java agent uses 20ms when the queue is empty for the last time, and always reads immediately when had data last time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1520754591

   A fix is merged, please feel free to try it out or wait for an official release within three days. @Forstwith 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1503715453

   > About the default value, what does this use for? Is it about consuming from an in-memory queue for gRPC report? If so, it doesn't have to be so little. The Java agent uses 20ms when the queue is empty for the last time, and always reads immediately when had data last time.
   
   That is correct, its the interval between each bulk of segments/logs for reporter. Thanks for the advice, we should consider follow the Java agent's implementation and test if such mechanism work well in the context of Python. Our previous benchmark could change in terms of performance impact to services after this interval is tuned.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1504411113

   > @Superskyyy BTW, if this is critical, you could choose to cherry pick this fix to 1.0.x branch and release a patch release only. I can see several other issues of 1.1.0 are still open.
   
   Thanks for the suggestion, I will consider releasing a v1.0.1 within the week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] FAWC438 commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.

FAWC438 commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1512872536

   In my tests, the solution proposed by @Forstwith can indeed solve the problem of high CPU usage. But I still think setting the wait parameter is a better solution.
   
   First, let's look at the impact of different solutions on application performance. I benchmarked the python agent with different solutions using wrk in the same environment.
   
   1. _remove empty check_
   
   ```bash
   [root@localhost wrk]# ./wrk -t36 -c4000 -d60s --latency http://10.3.242.223:8086/cat
   Running 1m test @ http://10.3.242.223:8086/cat
     36 threads and 4000 connections
     Thread Stats   Avg      Stdev     Max   +/- Stdev
       Latency     1.38s   238.84ms   1.65s    79.05%
       Req/Sec   146.40    272.77     1.08k    85.18%
     Latency Distribution
        50%    1.36s
        75%    1.55s
        90%    1.63s
        99%    1.65s
     36909 requests in 1.00m, 5.88MB read
     Socket errors: connect 0, read 805, write 0, timeout 30433
   Requests/sec:    614.21
   Transfer/sec:    100.17KB
   ```
   
   2. _wait 0.01s_
   
   ```bash
   [root@localhost wrk]# ./wrk -t36 -c4000 -d60s --latency http://10.3.242.223:8086/cat
   Running 1m test @ http://10.3.242.223:8086/cat
     36 threads and 4000 connections
     Thread Stats   Avg      Stdev     Max   +/- Stdev
       Latency     1.13s   295.78ms   1.46s    64.94%
       Req/Sec   141.00    244.04     1.08k    83.20%
     Latency Distribution
        50%    1.08s
        75%    1.43s
        90%    1.44s
        99%    1.45s
     39912 requests in 1.00m, 6.36MB read
     Socket errors: connect 0, read 664, write 0, timeout 36792
   Requests/sec:    664.13
   Transfer/sec:    108.31KB
   ```
   
   3. _wait 0.02s_
   
   ```bash
   [root@localhost wrk]# ./wrk -t36 -c4000 -d60s --latency http://10.3.242.223:8086/cat
   Running 1m test @ http://10.3.242.223:8086/cat
     36 threads and 4000 connections
     Thread Stats   Avg      Stdev     Max   +/- Stdev
       Latency     1.26s   409.85ms   1.96s    64.68%
       Req/Sec   128.19    235.44     1.10k    87.68%
     Latency Distribution
        50%    1.30s
        75%    1.60s
        90%    1.79s
        99%    1.94s
     38701 requests in 1.00m, 6.16MB read
     Socket errors: connect 0, read 1295, write 0, timeout 29983
   Requests/sec:    643.97
   Transfer/sec:    105.02KB
   ```
   
   4. _wait 0.1s_
   
   ```bash
   [root@localhost wrk]# ./wrk -t36 -c4000 -d60s --latency http://10.3.242.223:8086/cat
   Running 1m test @ http://10.3.242.223:8086/cat
     36 threads and 4000 connections
     Thread Stats   Avg      Stdev     Max   +/- Stdev
       Latency     1.21s   431.33ms   1.99s    62.69%
       Req/Sec    98.06    175.53     1.10k    88.71%
     Latency Distribution
        50%    1.20s 
        75%    1.64s 
        90%    1.81s 
        99%    1.99s 
     41067 requests in 1.00m, 6.54MB read
     Socket errors: connect 0, read 1878, write 0, timeout 31134
   Requests/sec:    683.39
   Transfer/sec:    111.45KB
   ```
   
   It can be seen that the performance is almost identical, and the no-load cpu utilization of each solution does not exceed 3%, and the full-load cpu utilization is about 110% (under 12-core CPU). This shows that this bug is not the performance bottleneck of the python agent, and all the current changes have little impact on the performance of the Python agent.
   
   Next, I want to explain the reasons why the empty check cannot be removed in my opinion. You can take [a function](https://github.com/apache/skywalking-python/blob/077400367e4a4f8e82ce56e472c6cb34a0a81abd/skywalking/agent/protocol/grpc.py#L162-L192) to obtain queue log information in the source code as an example
   
   If the queue empty check is removed, the task of preventing this thread from hogging the cpu in a loop is left to the python `queue`'s blocking `get()` method. If the block parameter of the `get()` method is True , it will cause the thread to release the cpu. This will result in: 
   
   - **Repeated acquisition of thread locks**: Essentially, both [`queue.get()`](https://github.com/python/cpython/blob/1dad3d10714fe9611769196fde4e94bb63eb7f17/Lib/queue.py#L180) and [`threading.Event.wait()`](https://github.com/python/cpython/blob/1dad3d10714fe9611769196fde4e94bb63eb7f17/Lib/threading.py#L622) block the thread through threading.Conditon(), which needs to acquire the thread lock, even if the blocking timeout is 0. Handing blocking tasks to queues instead of Even objects will repeatedly acquire thread locks, which may add unnecessary overhead.
   - **May cause hidden dangers to future code**: In the current implementation of the three communication protocols (grpc/kafka/http), whether the queue is blocked is an optional parameter, and the blocking timeout parameter of the queue is even configurable for users, which means that if these code changes may cause the same bug again. Also, this could be a potential threat if the Python agent plans to support additional communication protocols in the future
   
   As for how to determine the value of the wait parameter, I have no clear idea, because this part does not seem to be the current performance bottleneck of the Python agent, and it is not easy to draw conclusions from my benchmark tests. Is it OK to directly set it to 0.02 (20ms) similar to the Java agent?
   
   If the community thinks this solution works, please allow me to submit a related PR :)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] FAWC438 commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "FAWC438 (via GitHub)" <gi...@apache.org>.

FAWC438 commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1503593982

   This is a strange problem, because in the 0.8.0 python agent, the default value of this wait parameter is still 0 and does not produce any excessive CPU usage.
   
   I suspected that the new decorator syntax sugar in 1.0.0 affected it, but after I converted the decorator syntax sugar into normal python syntax, the bug still exists
   
   I compared the source code of version 1.0.0 and version 0.8.0 and I think that the `threading.Event()` object **_finished** that manages the thread might be the cause of the problem (but I'm very not sure). First, it directly called the wait() method that caused the bug; second, I noticed that it changed from a global variable in 0.8.0 to a class member in 1.0.0, and this object should obviously be shared by threads.
   
   Here is the relevant [Python official documentation](https://docs.python.org/3/library/threading.html#event-objects) about `threading.Event()`
   
   Of course, given the urgency of this bug, the simple method of changing the parameter value I mentioned earlier may solve the basic problem. However, there may be a slight impact on the performance of the internal queue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] wu-sheng commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "wu-sheng (via GitHub)" <gi...@apache.org>.

wu-sheng commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1504375182

   @Superskyyy  BTW, if this is critical, you could choose to cherry pick this fix to 1.0.x branch and release a patch release only. 
   I can see several other issues of 1.1.0 are still open.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Forstwith commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "Forstwith (via GitHub)" <gi...@apache.org>.

Forstwith commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1508006571

   > This is a strange problem, because in the 0.8.0 python agent, the default value of this wait parameter is still 0 and does not produce any excessive CPU usage.
   > 
   > I suspected that the new decorator syntax sugar in 1.0.0 affected it, but after I converted the decorator syntax sugar into normal python syntax, the bug still exists
   > 
   > I compared the source code of version 1.0.0 and version 0.8.0 and I think that the `threading.Event()` object **_finished** that manages the thread might be the cause of the problem (but I'm very not sure). First, it directly called the wait() method that caused the bug; second, I noticed that it changed from a global variable in 0.8.0 to a class member in 1.0.0, and this object should obviously be shared by threads.
   > 
   > Here is the relevant [Python official documentation](https://docs.python.org/3/library/threading.html#event-objects) about `threading.Event()`
   > 
   > Of course, given the urgency of this bug, the simple method of changing the parameter value I mentioned earlier may solve the basic problem. However, there may be a slight impact on the performance of the internal queue.
   
   
   
   > Right, using `event.wait(0)` is very close to busy waiting to me, since it constantly checks for if the event is set. It will cause high cpu usage. Though I also don't know what happen previously and why it didn't cause similar performance degradation (but it obviously should have, maybe something amplified it in 1.0.0 through the singleton refactoring that I did), will need a deeper look into the previous code (I will take a look later this week).
   > 
   > But meanwhile I think its good to add a wait(0.01 or even a bit more based on empirical evidence), the default 0 is definitely not friendly for production envs.
   
   The difference between 0.8.0 and 1.0.0 is that 1.0.0 adds a non-empty judgment before a blocking queue, which makes it never have a chance to block.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Forstwith commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "Forstwith (via GitHub)" <gi...@apache.org>.

Forstwith commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1505096128

   I just got this problem too.
   ![image](https://user-images.githubusercontent.com/22184723/231440858-758208b0-57a6-4c58-9078-2ee0c32350a7.png)
   
   Just remove this empty test can solve problem. Because it is a blocking queue, there is no need to judge empty
   ```python
       @report_with_backoff(reporter_name='segment', init_wait=0)
       def __report_segment(self) -> None:
           if not self.__segment_queue.empty(): # remove this empty test can solve problem. Because it is a blocking queue, there is no need to judge empty
               self.__protocol.report_segment(self.__segment_queue)
   
       @report_with_backoff(reporter_name='log', init_wait=0)
       def __report_log(self) -> None:
           if not self.__log_queue.empty():
               self.__protocol.report_log(self.__log_queue)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [skywalking] Superskyyy commented on issue #10672: [Bug] Python agent critical performance regression

Posted by "Superskyyy (via GitHub)" <gi...@apache.org>.

Superskyyy commented on issue #10672:
URL: https://github.com/apache/skywalking/issues/10672#issuecomment-1513200405

   > In my tests, the solution proposed by @Forstwith can indeed solve the problem of high CPU usage. But I still think setting the wait parameter is a better solution.
   > 
   > 
   > 
   > First, let's look at the impact of different solutions on application performance. I benchmarked the python agent with different solutions using wrk in the same environment.
   > 
   > 
   > 
   > 1. _remove empty check_
   > 
   > 
   > 
   > ```bash
   > 
   > [root@localhost wrk]# ./wrk -t36 -c4000 -d60s --latency http://*/cat
   > 
   > Running 1m test @ http://*/cat
   > 
   >   36 threads and 4000 connections
   > 
   >   Thread Stats   Avg      Stdev     Max   +/- Stdev
   > 
   >     Latency     1.38s   238.84ms   1.65s    79.05%
   > 
   >     Req/Sec   146.40    272.77     1.08k    85.18%
   > 
   >   Latency Distribution
   > 
   >      50%    1.36s
   > 
   >      75%    1.55s
   > 
   >      90%    1.63s
   > 
   >      99%    1.65s
   > 
   >   36909 requests in 1.00m, 5.88MB read
   > 
   >   Socket errors: connect 0, read 805, write 0, timeout 30433
   > 
   > Requests/sec:    614.21
   > 
   > Transfer/sec:    100.17KB
   > 
   > ```
   > 
   > 
   > 
   > 2. _wait 0.01s_
   > 
   > 
   > 
   > ```bash
   > 
   > [root@localhost wrk]# ./wrk -t36 -c4000 -d60s --latency http://*/cat
   > 
   > Running 1m test @ http://*/cat
   > 
   >   36 threads and 4000 connections
   > 
   >   Thread Stats   Avg      Stdev     Max   +/- Stdev
   > 
   >     Latency     1.13s   295.78ms   1.46s    64.94%
   > 
   >     Req/Sec   141.00    244.04     1.08k    83.20%
   > 
   >   Latency Distribution
   > 
   >      50%    1.08s
   > 
   >      75%    1.43s
   > 
   >      90%    1.44s
   > 
   >      99%    1.45s
   > 
   >   39912 requests in 1.00m, 6.36MB read
   > 
   >   Socket errors: connect 0, read 664, write 0, timeout 36792
   > 
   > Requests/sec:    664.13
   > 
   > Transfer/sec:    108.31KB
   > 
   > ```
   > 
   > 
   > 
   > 3. _wait 0.02s_
   > 
   > 
   > 
   > ```bash
   > 
   > [root@localhost wrk]# ./wrk -t36 -c4000 -d60s --latency http://*/cat
   > 
   > Running 1m test @ http://*/cat
   > 
   >   36 threads and 4000 connections
   > 
   >   Thread Stats   Avg      Stdev     Max   +/- Stdev
   > 
   >     Latency     1.26s   409.85ms   1.96s    64.68%
   > 
   >     Req/Sec   128.19    235.44     1.10k    87.68%
   > 
   >   Latency Distribution
   > 
   >      50%    1.30s
   > 
   >      75%    1.60s
   > 
   >      90%    1.79s
   > 
   >      99%    1.94s
   > 
   >   38701 requests in 1.00m, 6.16MB read
   > 
   >   Socket errors: connect 0, read 1295, write 0, timeout 29983
   > 
   > Requests/sec:    643.97
   > 
   > Transfer/sec:    105.02KB
   > 
   > ```
   > 
   > 
   > 
   > 4. _wait 0.1s_
   > 
   > 
   > 
   > ```bash
   > 
   > [root@localhost wrk]# ./wrk -t36 -c4000 -d60s --latency http://*/cat
   > 
   > Running 1m test @ http://*/cat
   > 
   >   36 threads and 4000 connections
   > 
   >   Thread Stats   Avg      Stdev     Max   +/- Stdev
   > 
   >     Latency     1.21s   431.33ms   1.99s    62.69%
   > 
   >     Req/Sec    98.06    175.53     1.10k    88.71%
   > 
   >   Latency Distribution
   > 
   >      50%    1.20s 
   > 
   >      75%    1.64s 
   > 
   >      90%    1.81s 
   > 
   >      99%    1.99s 
   > 
   >   41067 requests in 1.00m, 6.54MB read
   > 
   >   Socket errors: connect 0, read 1878, write 0, timeout 31134
   > 
   > Requests/sec:    683.39
   > 
   > Transfer/sec:    111.45KB
   > 
   > ```
   > 
   > 
   > 
   > It can be seen that the performance is almost identical, and the no-load cpu utilization of each solution does not exceed 3%, and the full-load cpu utilization is about 110% (under 12-core CPU). This shows that this bug is not the performance bottleneck of the python agent, and all the current changes have little impact on the performance of the Python agent.
   > 
   > 
   > 
   > Next, I want to explain the reasons why the empty check cannot be removed in my opinion. You can take [a function](https://github.com/apache/skywalking-python/blob/077400367e4a4f8e82ce56e472c6cb34a0a81abd/skywalking/agent/protocol/grpc.py#L162-L192) to obtain queue log information in the source code as an example
   > 
   > 
   > 
   > If the queue empty check is removed, the task of preventing this thread from hogging the cpu in a loop is left to the python `queue`'s blocking `get()` method. If the block parameter of the `get()` method is True , it will cause the thread to release the cpu. This will result in: 
   > 
   > 
   > 
   > - **Repeated acquisition of thread locks**: Essentially, both [`queue.get()`](https://github.com/python/cpython/blob/1dad3d10714fe9611769196fde4e94bb63eb7f17/Lib/queue.py#L180) and [`threading.Event.wait()`](https://github.com/python/cpython/blob/1dad3d10714fe9611769196fde4e94bb63eb7f17/Lib/threading.py#L622) block the thread through threading.Conditon(), which needs to acquire the thread lock, even if the blocking timeout is 0. Handing blocking tasks to queues instead of Even objects will repeatedly acquire thread locks, which may add unnecessary overhead.
   > 
   > - **May cause hidden dangers to future code**: In the current implementation of the three communication protocols (grpc/kafka/http), whether the queue is blocked is an optional parameter, and the blocking timeout parameter of the queue is even configurable for users, which means that if these code changes may cause the same bug again. Also, this could be a potential threat if the Python agent plans to support additional communication protocols in the future
   > 
   > 
   > 
   > As for how to determine the value of the wait parameter, I have no clear idea, because this part does not seem to be the current performance bottleneck of the Python agent, and it is not easy to draw conclusions from my benchmark tests. Is it OK to directly set it to 0.02 (20ms) similar to the Java agent?
   > 
   > 
   > 
   > If the community thinks this solution works, please allow me to submit a related PR :)
   > 
   > 
   > 
   > 
   > 
   > 
   
   Go ahead. Seems great!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@skywalking.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org