You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "João Oliveirinha (Jira)" <ji...@apache.org> on 2020/06/23 11:36:00 UTC

[jira] [Commented] (KAFKA-9693) Kafka latency spikes caused by log segment flush on roll

    [ https://issues.apache.org/jira/browse/KAFKA-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142857#comment-17142857 ] 

João Oliveirinha commented on KAFKA-9693:
-----------------------------------------

Has anyone been able to reproduce this and identify the root cause directly? I believe that this happens independetly if the log is being rolled or not.

I am having simular problems. The only difference is that this happen on the high percentiles (99.999 and 99.99 sometimes) without any log.roll or application flush happening.

I still see spikes that normally go to the 200ms independently of the throughput pushed by the producer. For these tests I am using a single producer and single consumer on the same process. In attachment you could see a plot on the end-2-end lantency (producer to consumer).

I am using default configurations and I have been able to push those spikes to the 99.999 most of the times by tuning the kernel parameters for page flush cache to be done every 1 second instead of the default 30s.

What is strange to me is the 200ms being pretty much consistent and unrelated to throuthput. I also tried to increment the max inflight requests as a way to push these latencies even more to the higher percentiles but it had an unexpected behaviour to me. In the higher throughput  runs the average latency increases from 2ms to ~40ms.

 

My configs:

*Producer*:
linger=0
batch=16k
recordSize=2k
compression=none
max.in.flight.conn=5

*Setup*:

6 i3.2xlarge with 1.9TB SSD on 3 AZs1 c5.2xlarge for the injector/consumer

 

 

!image-2020-06-23-12-24-58-788.png|width=975,height=520!

 

 

> Kafka latency spikes caused by log segment flush on roll
> --------------------------------------------------------
>
>                 Key: KAFKA-9693
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9693
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>         Environment: OS: Amazon Linux 2
> Kafka version: 2.2.1
>            Reporter: Paolo Moriello
>            Assignee: Paolo Moriello
>            Priority: Major
>              Labels: Performance, latency, performance
>         Attachments: image-2020-03-10-13-17-34-618.png, image-2020-03-10-14-36-21-807.png, image-2020-03-10-15-00-23-020.png, image-2020-03-10-15-00-54-204.png, image-2020-06-23-12-24-46-548.png, image-2020-06-23-12-24-58-788.png, latency_plot2.png
>
>
> h1. Summary
> When a log segment fills up, Kafka rolls over onto a new active segment and force the flush of the old segment to disk. When this happens, log segment _append_ duration increase causing important latency spikes on producer(s) and replica(s). This ticket aims to highlight the problem and propose a simple mitigation: add a new configuration to enable/disable rolled segment flush.
> h1. 1. Phenomenon
> Response time of produce request (99th ~ 99.9th %ile) repeatedly spikes to ~50x-200x more than usual. For instance, normally 99th %ile is lower than 5ms, but when this issue occurs, it marks 100ms to 200ms. 99.9th and 99.99th %iles even jump to 500-700ms.
> Latency spikes happen at constant frequency (depending on the input throughput), for small amounts of time. All the producers experience a latency increase at the same time.
> h1. !image-2020-03-10-13-17-34-618.png|width=942,height=314!
> {{Example of response time plot observed during on a single producer.}}
> URPs rarely appear in correspondence of the latency spikes too. This is harder to reproduce, but from time to time it is possible to see a few partitions going out of sync in correspondence of a spike.
> h1. 2. Experiment
> h2. 2.1 Setup
> Kafka cluster hosted on AWS EC2 instances.
> h4. Cluster
>  * 15 Kafka brokers: (EC2 m5.4xlarge)
>  ** Disk: 1100Gb EBS volumes (4750Mbps)
>  ** Network: 10 Gbps
>  ** CPU: 16 Intel Xeon Platinum 8000
>  ** Memory: 64Gb
>  * 3 Zookeeper nodes: m5.large
>  * 6 producers on 6 EC2 instances in the same region
>  * 1 topic, 90 partitions - replication factor=3
> h4. Broker config
> Relevant configurations:
> {quote}num.io.threads=8
>  num.replica.fetchers=2
>  offsets.topic.replication.factor=3
>  num.network.threads=5
>  num.recovery.threads.per.data.dir=2
>  min.insync.replicas=2
>  num.partitions=1
> {quote}
> h4. Perf Test
>  * Throughput ~6000-8000 (~40-70Mb/s input + replication = ~120-210Mb/s per broker)
>  * record size = 20000
>  * Acks = 1, linger.ms = 1, compression.type = none
>  * Test duration: ~20/30min
> h2. 2.2 Analysis
> Our analysis showed an high +correlation between log segment flush count/rate and the latency spikes+. This indicates that the spikes in max latency are related to Kafka behavior on rolling over new segments.
> The other metrics did not show any relevant impact on any hardware component of the cluster, eg. cpu, memory, network traffic, disk throughput...
>  
>  !latency_plot2.png|width=924,height=308!
>  {{Correlation between latency spikes and log segment flush count. p50, p95, p99, p999 and p9999 latencies (left axis, ns) and the flush #count (right axis, stepping blue line in plot).}}
> Kafka schedules logs flushing (this includes flushing the file record containing log entries, the offset index, the timestamp index and the transaction index) during _roll_ operations. A log is rolled over onto a new empty log when:
>  * the log segment is full
>  * the maxtime has elapsed since the timestamp of first message in the segment (or, in absence of it, since the create time)
>  * the index is full
> In this case, the increase in latency happens on _append_ of a new message set to the active segment of the log. This is a synchronous operation which therefore blocks producers requests, causing the latency increase.
> To confirm this, I instrumented Kafka to measure the duration of FileRecords.append(MemoryRecords) method, which is responsible of writing memory records to file. As a result, I observed the same spiky pattern as in the producer latency, with a one-to-one correspondence with the append duration.
>  !image-2020-03-10-14-36-21-807.png|width=780,height=415!
>  {{FileRecords.append(MemoryRecords) duration during test run.}}
> Therefore, every time a new log segment (log.segment.bytes is set to default value of 1Gb) is rolled, Kafka forces a flush of the completed segment, which appears to slowdown the subsequent append requests on the active segment.
> h2. 2.3 Solution
> I managed to completely mitigate the problem by disabling the flush happening on log segment roll. Latency spikes and append duration flattened down.
>  !image-2020-03-10-15-00-23-020.png|width=906,height=302!
>  !image-2020-03-10-15-00-54-204.png|width=903,height=301!
> {{Producer response time before and after disabling log flush.}}
>   
>  Generally, it is possible to control Kafka's flush behavior by setting a bunch of log.flush.xxx configurations. This flush policy can be controlled to force data to disk after a period of time or after a certain number of messages has been written.
>   
>  However, these configuration don't have any impact on the flush of "rolled segments", which is scheduled and executed anyway.
>   
>  Therefore, the suggested solution is to add a new configuration to potentially control (enable/disable) this flush invocation.
> Note: what are the implications of disabling the log segment flush? Forcing the flush of old segments provides higher durability guarantees. In case of system crash, in fact, we would potentially lose only messages in the active segment log. By disabling this operation, instead, we'd increase the risk of losing more data.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)