You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Ismael Juma (Jira)" <ji...@apache.org> on 2022/12/23 15:48:00 UTC
[jira] [Commented] (KAFKA-14541) Profile produce workload for Apache Kafka

    [ https://issues.apache.org/jira/browse/KAFKA-14541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651681#comment-17651681 ] 

Ismael Juma commented on KAFKA-14541:
-------------------------------------

Can you share a little more details regarding your producer setup, was it a single producer? Also, is there a reason why `linger.ms=0` and `batch.size=9000` was used for this benchmark? If you have a benchmark with reasonable throughput, you'd usually want those numbers to be higher.

> Profile produce workload for Apache Kafka
> -----------------------------------------
>
>                 Key: KAFKA-14541
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14541
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Divij Vaidya
>            Assignee: Divij Vaidya
>            Priority: Major
>         Attachments: flamegraph-openjdk11nodebug-cpu-withoutobjectserializer.html
>
>
> I have been profiling Kafka (3.4.0 / trunk right now) for a produce only workload and the [OpenMessaging|https://openmessaging.cloud/docs/benchmarks/] workloads. The goal is to get a better understanding of CPU usage profile for Kafka and eliminate potential overheads to reduce CPU consumption.
> h2. *Setup*
> R6i.16xl (64 cores)
> OS: Amazon Linux 2
> Single broker, One topic, One partition
> Plaintext
> Prometheus Java agent attached
>  
> {code:java}
> -XX:+PreserveFramePointer
> -XX:+UnlockDiagnosticVMOptions
> -XX:+DebugNonSafepoints{code}
>  
>  
> {code:java}
> queued.max.requests=10000
> num.network.threads=32
> num.io.threads=128
> socket.request.max.bytes=104857600{code}
>  
> h3. Producer setup:
> {code:java}
> batch.size=9000
> buffer.memory=33554432
> enable.idempotence=false
> linger.ms=0
> receive.buffer.bytes=-1
> send.buffer.bytes=-1
> max.in.flight.requests.per.connection=100000{code}
>  
> h3. Profiler setup:
> [async-profiler|https://github.com/jvm-profiling-tools/async-profiler] (this profiler + -XX:+DebugNonSafepoints ensure that profiling doesn't suffer from safepoint bias). Note that flamegraph can be generated in "cpu" mode or "wall" mode (wall clock time) or "cycles" mode (used for better kernel call stack)
> h2. Observations
>  # Processor.run > Processor#updateRequestMetrics() is a very expensive call. We should revisit whether we want to pre-compute histograms or not. Maybe upgrading to latest dropwizard will improve this? (JIRA: https://issues.apache.org/jira/browse/KAFKA-14423)
>  # (Processor > Selector.poll), (Processor > Selector.write) and many other places  - Accumulative cost of Sensor#recordInternal is high. 
>  # Processor threads are consuming more CPU than Handler threads?!! Perhaps because handler threads spend a lot of time waiting for partition lock at UnifiedLock.scala
>  # HandleProduceRequest > RequestSizeInBytes - Unnecessary call to calculate size in bytes here. Low hanging opportunity to improve CPU utilisation for a request heavy workload. (Fixed in https://issues.apache.org/jira/browse/KAFKA-14414)
>  # UnifiedLog#append > HeapByteBuffer.duplicate() - Why do we duplicate the buffer here? Ideally we shouldn't be making copies of buffer during the produce workflow. We should be using the same buffer after reading from socket to writing in a file.
>  # Processor > Selector.select - Why is epoll consuming CPU cycles? It should have the thread in a timed_waiting state and hence, shouldn't consume CPU at all.
>  # In a produce workload writing to socket is more CPU intensive than reading from the socket. This is surprising because reading would read more data from the socket (produce records) whereas writing would only write the response back which doesn't contain record data.
>  # RequestChannel#sendResponse > wakeup - This is the call which wakes up the selector by writing to a file descriptor. Why is this so expensive?
>  # The bottleneck in throughout is thread contention at the UnifiedLog.scala where handler thread wait for 80-90% of the time trying to acquire a lock. This observation was recorded from separate profiling. Can we remove this bottleneck by using fine grained lock for highwatermark update and log append as the current coarse grained lock causes contention?
> I am still analysing the flamegraph (cpu mode attached here). Please feel free to comment on any of the observations or add your own observations here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)