You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by GitBox <gi...@apache.org> on 2021/02/22 20:38:14 UTC
[GitHub] [trafficserver] bryancall opened a new issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
bryancall opened a new issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546
When running benchmarking on large AMD servers (128 core servers) I am seeing a large amount of CPU being used in `LogObject::_checkout_write`. Here are the results from the benchmark:
https://github.com/bryancall/benchmark/blob/main/reports/amd_epic_7742_rhel7_aocc_rpms_no-freelist_jemalloc_openss111/http2_benchmark.report
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-786257592
When I ran the benchmark on a 64-CPU Intex Xeon, I did not see this problem. It seems it is AMD dependent, or perhaps configuration dependent.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras closed issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras closed issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras removed a comment on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras removed a comment on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-784490419
Any record of the configuration that cased this problem?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] bryancall edited a comment on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
bryancall edited a comment on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-786356094
Here is output from the package using the gcc compiler. This is using the http/2 benchmark. It isn't as bad as AOCC, but still it is an expensive operation:
```
**http2load**
finished in 120.34s, 24928.55 req/s, 390.48MB/s
requests: 3000000 total, 3000000 started, 3000000 done, 3000000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 3000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 45.89GB (49274125891) total, 39.21MB (41119867) headers (space savings 96.56%), 45.78GB (49152000000) data
min max mean sd +/- sd
time for request: 297us 629.51ms 8.01ms 8.71ms 88.32%
time for connect: 33.66ms 91.14ms 53.87ms 14.13ms 55.50%
time to 1st byte: 35.44ms 123.39ms 69.10ms 16.27ms 63.50%
req/s : 121.07 183.79 140.32 22.49 72.00%
**dstat**
You did not select any stats, using -cdngy by default.
ESC[7l----total-cpu-usage---- -dsk/total- ---net/lo-- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send: recv send| in out | int csw
1 0 99 0 0 0|2614B 211k| 0 0 : 0 0 | 0 0 | 28k 84k
44 34 22 0 0 0|3123B 1209k| 338M 338M: 11M 640k| 0 0 |1045k 443k
46 34 20 0 0 0| 0 11M| 355M 355M: 11M 727k| 0 0 |1073k 440k
46 34 20 0 0 0| 0 54M| 347M 347M: 11M 589k| 0 0 |1069k 419k
45 35 19 0 0 0| 819B 688k| 386M 386M: 11M 1408k| 0 0 |1089k 509k
48 32 19 0 0 0| 0 5896k| 368M 368M: 11M 1487k| 0 0 |1084k 499k
43 37 20 0 0 1| 0 55M| 378M 378M: 11M 1542k| 0 0 |1096k 494k
45 36 19 1 0 1|2048B 57M| 384M 384M: 11M 1689k| 0 0 |1095k 512k
46 34 19 0 0 0| 0 4642k| 385M 385M: 11M 1547k| 0 0 |1089k 518k
33 38 28 0 0 1| 0 4176k| 465M 465M: 11M 1877k| 0 0 |1001k 652k
22 36 41 0 0 1|3686B 47M| 503M 503M: 11M 2015k| 0 0 | 858k 666k
22 34 41 2 0 1|3277B 47M| 439M 439M: 11M 1637k| 0 0 |1122k 561k
22 26 50 1 0 1|2867B 47M| 554M 554M: 11M 1814k| 0 0 | 806k 692k
**perf stat**
Performance counter stats for process id '17983':
10,466,445.26 msec task-clock # 86.920 CPUs utilized
28,088,848 context-switches # 0.003 M/sec
1,237,496 cpu-migrations # 0.118 K/sec
567,537 page-faults # 0.054 K/sec
33,958,594,177,446 cycles # 3.245 GHz (66.69%)
977,798,939,842 stalled-cycles-frontend # 2.88% frontend cycles idle (66.68%)
675,897,082,577 stalled-cycles-backend # 1.99% backend cycles idle (66.69%)
4,346,448,211,697 instructions # 0.13 insn per cycle
# 0.22 stalled cycles per insn (66.68%)
916,758,375,688 branches # 87.590 M/sec (66.68%)
27,001,398,959 branch-misses # 2.95% of all branches (66.70%)
120.414481542 seconds time elapsed
**perf report**
# Total Lost Samples: 0
#
# Samples: 48M of event 'cycles'
# Event count (approx.): 36487625633965
#
# Overhead Shared Object Symbol
# .......... .................... ..................................................
#
39.17% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
31.72% traffic_server [.] LogObject::_checkout_write
1.79% libpthread-2.17.so [.] pthread_mutex_trylock
1.67% libjemalloc.so.2 [.] je_malloc_mutex_lock_slow
0.54% [kernel.kallsyms] [k] copy_user_generic_string
0.46% [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
0.43% libc-2.17.so [.] __memcpy_ssse3
0.41% [kernel.kallsyms] [k] try_to_wake_up
0.38% [kernel.kallsyms] [k] native_write_msr_safe
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] bryancall commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
bryancall commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-786358062
Here is the annotation of the _checkout_write method:
![Screen Shot 2021-02-25 at 6 03 56 PM](https://user-images.githubusercontent.com/216749/109244605-d6a57e80-7793-11eb-817a-52e628407d60.png)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-786971941
I've not been able to reproduce the problem on the AMD proxy. There are my results:
```
-bash-4.2$ cat http2_benchmark.report
**http2load**
finished in 10.17s, 98320.46 req/s, 1.50GB/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 15.29GB (16419037931) total, 7.66MB (8031931) headers (space savings 96.63%), 15.26GB (16384000000) data
min max mean sd +/- sd
time for request: 111us 130.27ms 1.51ms 4.90ms 92.15%
time for connect: 3.38ms 87.15ms 38.33ms 20.31ms 69.00%
time to 1st byte: 5.59ms 106.70ms 47.40ms 19.96ms 74.00%
req/s : 476.66 803.50 671.46 97.62 65.50%
**dstat**
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send: recv send| in out | int csw
1 0 99 0 0 0|2597B 211k| 0 0 : 0 0 | 0 0 | 28k 84k
13 5 81 0 0 1| 0 19M|1590M 1590M:2169B 2064B| 0 0 | 340k 295k
**perf stat**
Performance counter stats for process id '52642':
167,926.60 msec task-clock # 16.268 CPUs utilized
644,529 context-switches # 0.004 M/sec
16,677 cpu-migrations # 0.099 K/sec
7,170 page-faults # 0.043 K/sec
292,679,346,215 cycles # 1.743 GHz (67.12%)
41,677,240,403 stalled-cycles-frontend # 14.24% frontend cycles idle (67.21%)
22,998,344,897 stalled-cycles-backend # 7.86% backend cycles idle (67.20%)
212,494,852,375 instructions # 0.73 insn per cycle
# 0.20 stalled cycles per insn (67.15%)
38,284,614,744 branches # 227.984 M/sec (67.13%)
1,354,399,053 branch-misses # 3.54% of all branches (67.09%)
10.322305247 seconds time elapsed
**perf report**
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 448142823289
#
# Overhead Shared Object Symbol
# .......... .................... ..................................................
#
7.17% libcrypto.so.1.1 [.] _aesni_ctr32_ghash_6x
5.81% libc-2.17.so [.] __memcpy_ssse3
3.51% [kernel.kallsyms] [k] copy_user_generic_string
2.65% libtscore.so.10.0.0 [.] freelist_new
1.94% [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
1.60% libc-2.17.so [.] __memcmp_sse4_1
1.04% libtscpputil.so.10.0 [.] memcmp
0.91% traffic_server [.] HpackIndexingTable::lookup
0.80% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
...
```
I'm seeing much lower req/s on the AMD than on the Intel Xeon.
Xeon:
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 1200.091
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4199.63
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 40960K
```
AMD:
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
CPU MHz: 2250.000
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.55
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras edited a comment on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras edited a comment on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-786971941
I've not been able to reproduce the problem on the AMD proxy. These are my results:
```
-bash-4.2$ cat http2_benchmark.report
**http2load**
finished in 10.17s, 98320.46 req/s, 1.50GB/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 15.29GB (16419037931) total, 7.66MB (8031931) headers (space savings 96.63%), 15.26GB (16384000000) data
min max mean sd +/- sd
time for request: 111us 130.27ms 1.51ms 4.90ms 92.15%
time for connect: 3.38ms 87.15ms 38.33ms 20.31ms 69.00%
time to 1st byte: 5.59ms 106.70ms 47.40ms 19.96ms 74.00%
req/s : 476.66 803.50 671.46 97.62 65.50%
**dstat**
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- ---net/lo-- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send: recv send| in out | int csw
1 0 99 0 0 0|2597B 211k| 0 0 : 0 0 | 0 0 | 28k 84k
13 5 81 0 0 1| 0 19M|1590M 1590M:2169B 2064B| 0 0 | 340k 295k
**perf stat**
Performance counter stats for process id '52642':
167,926.60 msec task-clock # 16.268 CPUs utilized
644,529 context-switches # 0.004 M/sec
16,677 cpu-migrations # 0.099 K/sec
7,170 page-faults # 0.043 K/sec
292,679,346,215 cycles # 1.743 GHz (67.12%)
41,677,240,403 stalled-cycles-frontend # 14.24% frontend cycles idle (67.21%)
22,998,344,897 stalled-cycles-backend # 7.86% backend cycles idle (67.20%)
212,494,852,375 instructions # 0.73 insn per cycle
# 0.20 stalled cycles per insn (67.15%)
38,284,614,744 branches # 227.984 M/sec (67.13%)
1,354,399,053 branch-misses # 3.54% of all branches (67.09%)
10.322305247 seconds time elapsed
**perf report**
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 448142823289
#
# Overhead Shared Object Symbol
# .......... .................... ..................................................
#
7.17% libcrypto.so.1.1 [.] _aesni_ctr32_ghash_6x
5.81% libc-2.17.so [.] __memcpy_ssse3
3.51% [kernel.kallsyms] [k] copy_user_generic_string
2.65% libtscore.so.10.0.0 [.] freelist_new
1.94% [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
1.60% libc-2.17.so [.] __memcmp_sse4_1
1.04% libtscpputil.so.10.0 [.] memcmp
0.91% traffic_server [.] HpackIndexingTable::lookup
0.80% [kernel.kallsyms] [k] _raw_spin_lock_irqsave
...
```
I'm seeing much lower req/s on the AMD than on the Intel Xeon.
Xeon:
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 1200.091
CPU max MHz: 3000.0000
CPU min MHz: 1200.0000
BogoMIPS: 4199.63
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 40960K
```
AMD:
```
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
CPU MHz: 2250.000
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4491.55
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-783835929
Outside chance that https://github.com/apache/trafficserver/pull/7382 would fix this.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-784490419
Any record of the configuration that cased this problem?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-790235752
I've been able to cause a bottleneck in LogObject::_checkout_write() on the AMD host, using open source TS (commit 5c65bcb4d0739cbadd09fab3d836110d529d48d2). I had to set proxy.config.exec_thread.autoconfig.scale to 2.0 . I ran TS (without manager) with this command:
```
TS_ROOT=$PWD LD_LIBRARY_PATH=$PWD/lib bin/traffic_server -f -F &
```
To see the bottleneck, I have to start with a empty log directory. I have to run 10 h2load processes at once with this script:
```
do_h2()
{
/home/wkaras/h2load --h1 -t 64 -D 60 -c 512 http://localhost:80$1/fake > ~/80$1.log &
}
rm -f ~/80*.log
for P in $( seq 81 90 )
do
do_h2 $P
done
```
"fake" is 1024 byte object that I create with a PUSH. When I run "perf top", I see about 95% of CPU time being spent in _checkout_write. But, the bottleneck gradually disappears, the % of CPU time drops (as squid.log gets full?).
For unknown reasons, I sometimes see the (disappearing) bottleneck in HttpTransact::OriginDead() instead of LogObject::_checkout_write().
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] bryancall commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
bryancall commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-786356094
Here is output from the package using the gcc compiler - it isn't as bad as AOCC, but still it is the most expensive operation by far:
```
**http2load**
finished in 32.31s, 92838.15 req/s, 1.44GB/s
requests: 3000000 total, 3000000 started, 3000000 done, 3000000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 3000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 46.49GB (49917000000) total, 583.65MB (612000000) headers (space savings 0.00%), 45.78GB (49152000000) data
min max mean sd +/- sd
time for request: 178us 864.37ms 1.88ms 6.56ms 95.16%
time for connect: 179us 60.27ms 30.83ms 24.62ms 32.00%
time to 1st byte: 7.67ms 252.08ms 93.93ms 50.69ms 68.00%
req/s : 469.10 651.57 573.43 54.94 64.50%
**dstat**
You did not select any stats, using -cdngy by default.
ESC[7l----total-cpu-usage---- -dsk/total- ---net/lo-- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send: recv send| in out | int csw
1 0 99 0 0 0|2531B 210k| 0 0 : 0 0 | 0 0 | 28k 84k
56 6 38 0 0 1| 0 726k|1590M 1590M: 11M 903k| 0 0 |1097k 307k
56 7 36 0 0 1| 43k 24M|1657M 1657M: 11M 779k| 0 0 | 960k 303k
41 6 53 0 0 1| 56k 1315k|1548M 1548M: 11M 1048k| 0 0 | 767k 270k
**perf stat**
Performance counter stats for process id '7940':
2,090,891.52 msec task-clock # 52.006 CPUs utilized
1,812,158 context-switches # 0.867 K/sec
173,492 cpu-migrations # 0.083 K/sec
2,323,668 page-faults # 0.001 M/sec
6,216,710,543,871 cycles # 2.973 GHz (66.78%)
673,283,513,494 stalled-cycles-frontend # 10.83% frontend cycles idle (66.79%)
305,534,719,417 stalled-cycles-backend # 4.91% backend cycles idle (66.81%)
2,089,728,448,270 instructions # 0.34 insn per cycle
# 0.32 stalled cycles per insn (66.80%)
452,605,506,651 branches # 216.465 M/sec (66.81%)
16,368,538,783 branch-misses # 3.62% of all branches (66.81%)
40.205123665 seconds time elapsed
**perf report**
# Total Lost Samples: 0
#
# Samples: 11M of event 'cycles'
# Event count (approx.): 7027789530465
#
# Overhead Shared Object Symbol
# .......... .................... ..................................................
#
12.97% traffic_server [.] LogObject::_checkout_write
4.54% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
3.92% traffic_server [.] TSHttpTxnReenable
3.13% libtscore.so.9.0.1 [.] malloc_new
3.08% libtscore.so.9.0.1 [.] malloc_free
2.87% traffic_server [.] HttpSM::state_api_callout
2.69% libtscore.so.9.0.1 [.] ink_freelist_new
2.63% traffic_server [.] HttpHookState::getNext
2.59% libtscore.so.9.0.1 [.] ink_freelist_free
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] zwoop commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
zwoop commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-783663137
Curious if (significantly) increasing proxy.config.log.log_buffer_size would help here ?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras edited a comment on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras edited a comment on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-785425829
One thing I'm seeing is that, when one thread is blocked waiting on this memory allocation: https://github.com/apache/trafficserver/blob/564663b9b6888c8334ec8077b5583da9b192c3c2/proxy/logging/LogObject.cc#L399 , any other thread trying to reserve (checkout) buffer space from the same LogObject will keep getting the return value LB_RETRY from LogBuffer::writer_checkout(), and will thus busy wait. So we could be indirectly seeing the memory allocation blocking problem. If this is the case, it would likely help to execute a thread yield when the LB_RETRY return value is returned.
I think I need to reproduce the problem with a baseline build, then see if this change improves things.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-790835517
I build this commit, https://github.com/ywkaras/trafficserver/commit/cd61f97667d9f90ca9ea8bf93e455be089ac086a , and re-ran the testing above. Total requests per second almost doubled when the yield was enabled.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] ywkaras commented on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
ywkaras commented on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-785425829
One thing I'm seeing is that, when one thread is blocked waiting on this memory allocation: https://github.com/apache/trafficserver/blob/564663b9b6888c8334ec8077b5583da9b192c3c2/proxy/logging/LogObject.cc#L399 , any other thread trying to reserve (checkout) will keep getting the return value LB_RETRY from LogBuffer::writer_checkout(), and will thus busy wait. So we could be indirectly seeing the memory allocation blocking problem. If this is the case, it would likely help to execute a thread yield when the LB_RETRY return value is returned.
I think I need to reproduce the problem with a baseline build, then see if this change improves things.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [trafficserver] bryancall edited a comment on issue #7546: LogObject::_checkout_write using a lot of CPU on many core servers
Posted by GitBox <gi...@apache.org>.
bryancall edited a comment on issue #7546:
URL: https://github.com/apache/trafficserver/issues/7546#issuecomment-786356094
Here is output from the package using the gcc compiler - it isn't as bad as AOCC, but still it is an expensive operation:
```
**http2load**
finished in 120.34s, 24928.55 req/s, 390.48MB/s
requests: 3000000 total, 3000000 started, 3000000 done, 3000000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 3000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 45.89GB (49274125891) total, 39.21MB (41119867) headers (space savings 96.56%), 45.78GB (49152000000) data
min max mean sd +/- sd
time for request: 297us 629.51ms 8.01ms 8.71ms 88.32%
time for connect: 33.66ms 91.14ms 53.87ms 14.13ms 55.50%
time to 1st byte: 35.44ms 123.39ms 69.10ms 16.27ms 63.50%
req/s : 121.07 183.79 140.32 22.49 72.00%
**dstat**
You did not select any stats, using -cdngy by default.
ESC[7l----total-cpu-usage---- -dsk/total- ---net/lo-- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send: recv send| in out | int csw
1 0 99 0 0 0|2614B 211k| 0 0 : 0 0 | 0 0 | 28k 84k
44 34 22 0 0 0|3123B 1209k| 338M 338M: 11M 640k| 0 0 |1045k 443k
46 34 20 0 0 0| 0 11M| 355M 355M: 11M 727k| 0 0 |1073k 440k
46 34 20 0 0 0| 0 54M| 347M 347M: 11M 589k| 0 0 |1069k 419k
45 35 19 0 0 0| 819B 688k| 386M 386M: 11M 1408k| 0 0 |1089k 509k
48 32 19 0 0 0| 0 5896k| 368M 368M: 11M 1487k| 0 0 |1084k 499k
43 37 20 0 0 1| 0 55M| 378M 378M: 11M 1542k| 0 0 |1096k 494k
45 36 19 1 0 1|2048B 57M| 384M 384M: 11M 1689k| 0 0 |1095k 512k
46 34 19 0 0 0| 0 4642k| 385M 385M: 11M 1547k| 0 0 |1089k 518k
33 38 28 0 0 1| 0 4176k| 465M 465M: 11M 1877k| 0 0 |1001k 652k
22 36 41 0 0 1|3686B 47M| 503M 503M: 11M 2015k| 0 0 | 858k 666k
22 34 41 2 0 1|3277B 47M| 439M 439M: 11M 1637k| 0 0 |1122k 561k
22 26 50 1 0 1|2867B 47M| 554M 554M: 11M 1814k| 0 0 | 806k 692k
**perf stat**
Performance counter stats for process id '17983':
10,466,445.26 msec task-clock # 86.920 CPUs utilized
28,088,848 context-switches # 0.003 M/sec
1,237,496 cpu-migrations # 0.118 K/sec
567,537 page-faults # 0.054 K/sec
33,958,594,177,446 cycles # 3.245 GHz (66.69%)
977,798,939,842 stalled-cycles-frontend # 2.88% frontend cycles idle (66.68%)
675,897,082,577 stalled-cycles-backend # 1.99% backend cycles idle (66.69%)
4,346,448,211,697 instructions # 0.13 insn per cycle
# 0.22 stalled cycles per insn (66.68%)
916,758,375,688 branches # 87.590 M/sec (66.68%)
27,001,398,959 branch-misses # 2.95% of all branches (66.70%)
120.414481542 seconds time elapsed
**perf report**
# Total Lost Samples: 0
#
# Samples: 48M of event 'cycles'
# Event count (approx.): 36487625633965
#
# Overhead Shared Object Symbol
# .......... .................... ..................................................
#
39.17% [kernel.kallsyms] [k] native_queued_spin_lock_slowpath
31.72% traffic_server [.] LogObject::_checkout_write
1.79% libpthread-2.17.so [.] pthread_mutex_trylock
1.67% libjemalloc.so.2 [.] je_malloc_mutex_lock_slow
0.54% [kernel.kallsyms] [k] copy_user_generic_string
0.46% [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter
0.43% libc-2.17.so [.] __memcpy_ssse3
0.41% [kernel.kallsyms] [k] try_to_wake_up
0.38% [kernel.kallsyms] [k] native_write_msr_safe
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org