You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by 张瑞 <zh...@fudan.edu.cn> on 2020/05/24 11:29:47 UTC

How should I measure the time spent in various stages of kafka

Hi,we have some questions about Kafka and want to ask you, hope this email does not disturb your life. We are doing some work based on kakfa, and we measured the time spent at each stage in the workflow of Kafka when producing and consuming messages. The results show that network transmission time occupies most of RTT time, up to more than 80%. We are puzzled by the results, so we would like to ask you if there are errors in our test methods. Next is our experiment environment and test method. We will show the test results at the end.

Enviroment

We conduct experiments on two servers, one for broker and the other for producer of consumer. A 100Gbps switch is used for connection them. The Kafka version we use is 2.4.0 release, and can be downloaded from https://github.com/apache/kafka/releases/tag/2.4.0.

Method

We want to test the time of kafka in various stages when producing messages and consuming messages. The following are our relevant changes based on the source code.

Timestamps for producing messages

At the producer side, we mark a nano timestamp at ./clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java before client.send(clientRequest, now); in function private void sendProduceRequest, to get the timestamp of the client sending the production message request. And mark a nano timestap at the same function before the line of handleProduceResponse(response, recordsByPartition, time.milliseconds()); in the same function, to get the timestamp of the client receiving response. We name the difference between these two timestamps as RTT.

At the broker side, we mark a nano timestamp at ./core/src/main/scala/kafka/server/KafkaRequestHandler.scala before apis.handle(request) in function def run(), to get the timestamp of the broker receiving request from clients. And mark a timestamp at ./core/src/main/scala/kafka/network/RequestChannel.scala before val processor = processors.get(response.processor) in function def sendResponse(response: RequestChannel.Response), to get the timestamp of the broker sending respone to clients. We name the difference between these two timestamps as Processing time in broker.

What's more, we add timestamps before and after val localProduceResults = appendToLocalLog(...) at ./core/src/main/scala/kafka/server/ReplicaManager.scala in function def appendRecords() , to get the time for writing to local log. We name the difference between these two timestamps as Time for writing.

Timestamps for fetching messages

At the consumer side, we mark a nano timestamp at ./clients/src/main/java/org/apache/kafka/clients/consumer/internals/Fetcher.java before RequestFuture<ClientResponse> future = client.send(fetchTarget, request); in function public synchronized int sendFetches(), to get the timestamp of the client sending the fetch request. And mark a nano timestap at the same function under the line of public void onSuccess(ClientResponse resp) { in the same function, to get the timestamp of the client receiving response. We name the difference between these two timestamps as RTT.

We also add timestamps before and after val logReadResults = readFromLog() at ./core/src/main/scala/kafka/server/ReplicaManager.scala in function fetchMessages , to get the time for reading from local log. We name the difference between these two timestamps as Time for reading.

Result

Here is the result for producing messages. Columns A and B are the independent variables we set. Columns C and D are measured from the broker, as we metioned in Timestamps for producing messages. The data in column E is obtained by dividing the corresponding column D data by column C, showing the proportion of time in writing to local log in the entire broker processing process. Column F is measured from the producer, as we metioned in Timestamps for producing messages, indicating the time from sending request to receiving response . The data in column G is the data in column F minus the data in column C, indicating the time it takes for data to be sent and received on the network.The data in column H is the column G divided by column F, indicating the proportion of data transmission time to RTT.

And the recorded results of consuming messages are shown in the figure below. You can see that there is a negative number, which is not realistic. But other data seems normal.

Question

Why is the network transmission time so high? Is it a problem with our timestamp design? Due to the existence of the NIO model, our test method may be inaccurate. How should we measure the time distribution of each stage of Kafka, such as the proportion of the processing time on the broker side and the proportion of the network transmission time? Can you give us some advice? I wish you a happy work, thank you!