You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Sam Cantero (Jira)" <ji...@apache.org> on 2021/09/21 19:46:00 UTC
[jira] [Commented] (KAFKA-12225) Unexpected broker bottleneck when scaling producers

    [ https://issues.apache.org/jira/browse/KAFKA-12225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418292#comment-17418292 ] 

Sam Cantero commented on KAFKA-12225:
-------------------------------------

is this similar to https://issues.apache.org/jira/browse/KAFKA-12838?

> Unexpected broker bottleneck when scaling producers
> ---------------------------------------------------
>
>                 Key: KAFKA-12225
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12225
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>         Environment: AWS Based
> 5-node cluster running on k8s with EBS attached disks (HDD)
> Kafka Version 2.5.0
> Multiple Producers (KafkaStreams, Akka Streams, golang Sarama)
>            Reporter: Harel Ben Attia
>            Priority: Major
>
>  
> *TLDR*: There seems to be a major lock contention that can happen on *{{Log.lock}}* during producer-scaling when produce-request sending is time-based ({{linger.ms}}) and not data-size based (max batch size).
> Hi,
> We're running a 5-node Kafka cluster on one of our production systems on AWS. Recently, we have started to notice that as our producer services scale out, the Kafka idle-percentage drops abruptly from ~70% idle percentage to 0% on all brokers, even though none of the physical resources of the brokers are exhausted.
> Initially, we realised that our {{io.thread}} count was too low, causing high request queuing and the low idle percentage, so we have increased it, hoping to see one of the physical resources maxing out. After changing it we still continued to see abrupt drops of the idle-percentage to 0% (with no physical resource maxing out), so we continued to investigate.
> The investigation has shown that there's a direct relation to {{linger.ms}} being the controlling factor of sending out produce requests. Whenever messages are being sent out from the producer due to the {{linger.ms}} threshold, scaling out the service increased the number of produce requests in a way which is not proportional to our traffic increase, bringing down all the brokers to a near-halt in terms of being able to process requests and, as mentioned, without any exhaustion of physical resources.
> After some more experiments and profiling a broker through flight recorder, we have found out that the cause of the issue is a lock contention on a *{{java.lang.Object}}*, wasting a lot of time on all the {{data-plane-kafka-request-handler}} threads. 90% of the locks were on Log's *{{lock: Object}}* instance, inside the *{{Log.append()}}* method. The stack traces show that these locks occur during the {{handleProductRequest}} method. We have ruled out replication as the source of the issues, as there were no replication issues, and the control-plane has a separate thread pool, so this focused us back on the actual producers, leading back to the behaviour of our producer service when scaling out.
> At that point we thought that maybe the issue is related to the number of partitions of the topic (60 currently), and increasing it would reduce the lock contention on each {{Log}} instance, but since each producer writes to all partitions (data is evenly spread and not skewed), then increasing the number of partitions would only cause each producer to generate more produce-requests, not alleviating the lock contention. Also, increasing the number of brokers would increase the idle percentage per broker, but essentially would not help reducing the produce-request latency, since this would not change the rate of produce-requests per Log.
> Eventually, we've worked around the issue by making the {{linger.ms}} value high enough so it stopped being the controlling factor of sending messages (e.g. produce-requests became coupled to the size of the traffic due to the max batch size becoming the controlling factor). This allowed us to utilise the cluster better without upscaling it.
> From our analysis, it seems that this lock behaviour limits Kafka's ability to be robust to producer configuration and scaling, and hurts the ability to do efficient capacity planning for the cluster, increasing the risk of an unexpected bottleneck when traffic increases.
> It would be great if you can validate these conclusions, or provide any more information that will help us understand the issue better or work around it in a more efficient way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)