You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Xinyi Su <xi...@gmail.com> on 2015/02/03 10:37:56 UTC

Kafka long tail latency issue

Hi,
I am building Kafka cluster and run producer perf test to get Kafka latency
performance.
>From test result, I notice that the long tail latency is very high and
increased with time passing by although the 99.9% result looks very good.
The worst latency can reach more than 1 second. Besides, disk utilization
is always very low, never more than 1%. I also try to tune
log.flush.interval.ms from 1000ms to 200ms. It does not help much.

Below is the max latency chart, Y axis represents the max latency in
millisecond, X axis represents the time elapsed in milliseconds. From
chart, we can see the latency increasing from about 10ms to 1095ms
gradually.

[image: Inline image]

Kafka cluster is built up with 4 hosts. The version is 2.9.2-0.8.2-beta.
The PerfTopic15 topic is created with 3 partition and 3 replication.

Here is my perf script usage:
-bash-4.1$ bin/kafka-producer-perf-test.sh   --broker-list <broker
list> --topics *PerfTopic15* --sync --initial-message-id 1 --messages
200000 --csv-reporter-enabled --metrics-dir /tmp/PerfTopic15_1
--message-send-gap-ms 20* --request-num-acks -1* --batch-size 1

-bash-4.1$ bin/kafka-topics.sh  --zookeeper <zkHost>:2181  --describe
--topic *PerfTopic15*
Topic:PerfTopic15 PartitionCount:3 ReplicationFactor:3 Configs:
Topic: PerfTopic15 Partition: 0 Leader: 3 Replicas: 3,4,1 Isr: 3,4,1
Topic: PerfTopic15 Partition: 1 Leader: 4 Replicas: 4,1,2 Isr: 4,1,2
Topic: PerfTopic15 Partition: 2 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3

I expect the worst latency not exceed 100 milliseconds. But the test result
is very discouraging. Do you have some points about Kafka long tail latency
issue?

Hope for your reply! Thanks in advance!

Re: Kafka long tail latency issue

Posted by Guozhang Wang <wa...@gmail.com>.
Hi Xinyi,

With ack = -1 and three replicas in ISR, the latency is bounded by the time
spent on follower replica fetching from the leader most of the time, since
the produce response cannot be acknowledged back until all ISR has fetched
the data.

You can try to reduce "replica.fetch.wait.max.ms" and increase
"num.replica.fetchers" in the broker configs:

http://kafka.apache.org/documentation.html#brokerconfigs

But note that this will increase the CPU / network usage.

Guozhang

On Tue, Feb 3, 2015 at 1:37 AM, Xinyi Su <xi...@gmail.com> wrote:

> Hi,
> I am building Kafka cluster and run producer perf test to get Kafka latency
> performance.
> From test result, I notice that the long tail latency is very high and
> increased with time passing by although the 99.9% result looks very good.
> The worst latency can reach more than 1 second. Besides, disk utilization
> is always very low, never more than 1%. I also try to tune
> log.flush.interval.ms from 1000ms to 200ms. It does not help much.
>
> Below is the max latency chart, Y axis represents the max latency in
> millisecond, X axis represents the time elapsed in milliseconds. From
> chart, we can see the latency increasing from about 10ms to 1095ms
> gradually.
>
> [image: Inline image]
>
> Kafka cluster is built up with 4 hosts. The version is 2.9.2-0.8.2-beta.
> The PerfTopic15 topic is created with 3 partition and 3 replication.
>
> Here is my perf script usage:
> -bash-4.1$ bin/kafka-producer-perf-test.sh   --broker-list <broker
> list> --topics *PerfTopic15* --sync --initial-message-id 1 --messages
> 200000 --csv-reporter-enabled --metrics-dir /tmp/PerfTopic15_1
> --message-send-gap-ms 20* --request-num-acks -1* --batch-size 1
>
> -bash-4.1$ bin/kafka-topics.sh  --zookeeper <zkHost>:2181  --describe
> --topic *PerfTopic15*
> Topic:PerfTopic15 PartitionCount:3 ReplicationFactor:3 Configs:
> Topic: PerfTopic15 Partition: 0 Leader: 3 Replicas: 3,4,1 Isr: 3,4,1
> Topic: PerfTopic15 Partition: 1 Leader: 4 Replicas: 4,1,2 Isr: 4,1,2
> Topic: PerfTopic15 Partition: 2 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
>
> I expect the worst latency not exceed 100 milliseconds. But the test result
> is very discouraging. Do you have some points about Kafka long tail latency
> issue?
>
> Hope for your reply! Thanks in advance!
>



-- 
-- Guozhang

Re: Kafka long tail latency issue

Posted by Jay Kreps <ja...@gmail.com>.
If you are on 0.8.1 or higher and are running with replication consider
disabling the forced log flush, that will definitely lead to latency spikes
as the flush is synchronous. You will still get durability from replication
and the background OS flush. On Linux the background I/O flush the OS does
doesn't have much impact.

Also we fixed several significant latency related bugs in 0.8.1 for the
0.8.2 release so consider giving that a try.

Finally Linux write performance is itself highly variable. Even in the
absence of any synchronous flushing there is some locking around I/O
operations like allocating new journal blocks. If you are running linux I
think we include some tuning options in the ops section of the
documentation that help reduce that. There is a test class
kafka.TestLinearWriteSpeed which will benchmark the throughput and latency
either using a plain file or a local Kafka log. It is worth doing this to
get a baseline for how fast and variable things can be in the absence of
any network or coordination.

-Jay

-Jay



On Tue, Feb 3, 2015 at 1:37 AM, Xinyi Su <xi...@gmail.com> wrote:

> Hi,
> I am building Kafka cluster and run producer perf test to get Kafka latency
> performance.
> From test result, I notice that the long tail latency is very high and
> increased with time passing by although the 99.9% result looks very good.
> The worst latency can reach more than 1 second. Besides, disk utilization
> is always very low, never more than 1%. I also try to tune
> log.flush.interval.ms from 1000ms to 200ms. It does not help much.
>
> Below is the max latency chart, Y axis represents the max latency in
> millisecond, X axis represents the time elapsed in milliseconds. From
> chart, we can see the latency increasing from about 10ms to 1095ms
> gradually.
>
> [image: Inline image]
>
> Kafka cluster is built up with 4 hosts. The version is 2.9.2-0.8.2-beta.
> The PerfTopic15 topic is created with 3 partition and 3 replication.
>
> Here is my perf script usage:
> -bash-4.1$ bin/kafka-producer-perf-test.sh   --broker-list <broker
> list> --topics *PerfTopic15* --sync --initial-message-id 1 --messages
> 200000 --csv-reporter-enabled --metrics-dir /tmp/PerfTopic15_1
> --message-send-gap-ms 20* --request-num-acks -1* --batch-size 1
>
> -bash-4.1$ bin/kafka-topics.sh  --zookeeper <zkHost>:2181  --describe
> --topic *PerfTopic15*
> Topic:PerfTopic15 PartitionCount:3 ReplicationFactor:3 Configs:
> Topic: PerfTopic15 Partition: 0 Leader: 3 Replicas: 3,4,1 Isr: 3,4,1
> Topic: PerfTopic15 Partition: 1 Leader: 4 Replicas: 4,1,2 Isr: 4,1,2
> Topic: PerfTopic15 Partition: 2 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
>
> I expect the worst latency not exceed 100 milliseconds. But the test result
> is very discouraging. Do you have some points about Kafka long tail latency
> issue?
>
> Hope for your reply! Thanks in advance!
>