You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "dan norwood (JIRA)" <ji...@apache.org> on 2019/05/23 15:19:00 UTC

[jira] [Created] (KAFKA-8414) org.apache.kafka.common.metrics.MetricsTest.testConcurrentReadUpdateReport hang

dan norwood created KAFKA-8414:
----------------------------------

             Summary: org.apache.kafka.common.metrics.MetricsTest.testConcurrentReadUpdateReport hang 
                 Key: KAFKA-8414
                 URL: https://issues.apache.org/jira/browse/KAFKA-8414
             Project: Kafka
          Issue Type: Bug
            Reporter: dan norwood


caveat: this only happens on AMD Epyc machines with >=48 cpus. i have below a bunch of machine info from various `*a.*` aws instance sizes i ran against.

 

 

i noticed what seems like a deadlock when running `org.apache.kafka.common.metrics.MetricsTest` on an aws instance with 96vCPUs (specifically a m5a.24xlarge). after some debugging it seems like the offending issue is [https://github.com/apache/kafka/blob/trunk/clients/src/test/java/org/apache/kafka/common/metrics/MetricsTest.java#L776-L778]

 
{code:java}
public void run() {
  try {
    while (alive.get()) {
      op.run();
    }
  } catch (Throwable t) {
    log.error("Metric {} failed with exception", opName, t);
  }
}
{code}
since the `op.run()` methods are all synchronized we end up nonstop hammering it. after adding some logging i saw steadily increasing wait times for entry in to each synchronized block. so this is not *really* a deadlock or hang, but a progressive slowdown that makes the test unrunnable.

 

 

the offending op seems to be [https://github.com/apache/kafka/blob/trunk/clients/src/test/java/org/apache/kafka/common/metrics/MetricsTest.java#L747]

 
{code:java}
Future<?> reportFuture = executorService.submit(new ConcurrentMetricOperation(alive, "report", () -> reporter.processMetrics()));
{code}
 

 

possible fix:

adding a `Thread.sleep(0, 1)` inside the runloop for `ConcurrentMetricOperation` seems to allow the test to pass. but i'm not sure that it wouldn't mask an issue that the test is meant to detect

 

 

 

 

Good: 

t3a.large
```
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7571
Stepping:            2
CPU MHz:             2200.116
BogoMIPS:            4400.23
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
```

t3a.2xlarge
```
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7571
Stepping:            2
CPU MHz:             2199.916
BogoMIPS:            4399.83
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
```

m5a.4xlarge
```
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           1
NUMA node(s):        1
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7571
Stepping:            2
CPU MHz:             2585.550
BogoMIPS:            4399.98
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-15
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
```


bad:
m5a.4xlarge
```
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        3
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7571
Stepping:            2
CPU MHz:             2315.397
BogoMIPS:            4400.13
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7,24-31
NUMA node1 CPU(s):   8-15,32-39
NUMA node2 CPU(s):   16-23,40-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
```

m5a.24xlarge
```
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        6
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7571
Stepping:            2
CPU MHz:             2421.512
BogoMIPS:            4399.19
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7,48-55
NUMA node1 CPU(s):   8-15,56-63
NUMA node2 CPU(s):   16-23,64-71
NUMA node3 CPU(s):   24-31,72-79
NUMA node4 CPU(s):   32-39,80-87
NUMA node5 CPU(s):   40-47,88-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext perfctr_core cpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt nrip_save
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)