You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Haoze Wu (Jira)" <ji...@apache.org> on 2023/04/10 16:14:00 UTC
[jira] [Updated] (KAFKA-14886) Broker request handler thread pool is full due to single request slowdown

     [ https://issues.apache.org/jira/browse/KAFKA-14886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Haoze Wu updated KAFKA-14886:
-----------------------------
    Description: 
In Kafka-2.8.0, we found that the number of data plane Kafka request handlers may quickly reach the limit when only one request is stuck. As a result, all other requests that require a data plane request handler will be stuck.

When there is a slowdown inside the storeOffsets function at line 777 due to I/O operation, the thread holds the lock acquired at line 754.
{code:java}
  private def doCommitOffsets(group: GroupMetadata,
                              memberId: String,
                              groupInstanceId: Option[String],
                              generationId: Int,
                              offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
                              responseCallback: immutable.Map[TopicPartition, Errors] => Unit): Unit = {
    group.inLock { // Line 754
..
      groupManager.storeOffsets(....) // Line 777
..
  }
} {code}
Its call stack is:
{code:java}
kafka.coordinator.group.GroupMetadata,inLock,227
kafka.coordinator.group.GroupCoordinator,handleCommitOffsets,755
kafka.server.KafkaApis,handleOffsetCommitRequest,515
kafka.server.KafkaApis,handle,175
kafka.server.KafkaRequestHandler,run,74
java.lang.Thread,run,748 {code}
This happens when the broker is handling the commit offset request from the consumer. When the slowdown mentioned above makes consumers get no response back, the consumer will automatically resend the request to the broker. Note that each request from the consumer is handled by a data-plane-kafka-request-handler thread. Therefore, another data-plane-kafka-request-handler thread will be also stuck at line 754 when handling the retry requests, because it tries to acquire the very same lock of the consumer group. The retry will occur repeatedly, and none of them can succeed. As a result, the pool of data-plane-kafka-request-handler threads will be full. Note that this pool of threads is responsible for handling all such requests from all producers and consumers. As a result, all the producers and consumers would be affected.

However, the backoff mechanism might be able to solve this issue, by reducing the number of requests in a short time and reserving more slots in the thread pool. Therefore, we increase the backoff config “retry-backoff-ms”, to see if the issue disappears. Specifically, we increase the retry backoff from 100ms (default) to 1000ms in consumer’s config. However, we found that the mentioned thread pool is full again, because there are multiple heartbeat requests that take up the slots of this thread pool. All those heartbeat request handling is stuck when they are acquiring the same consumer group lock, which has been acquired at line 754 as mentioned. Specifically, the heartbeat handling is stuck at GroupCoordinator.handleHeartbeat@624:
{code:java}
  def handleHeartbeat(groupId: String,
                      memberId: String,
                      groupInstanceId: Option[String],
                      generationId: Int,
                      responseCallback: Errors => Unit): Unit = {
..
      case Some(group) => group.inLock { // Line 624
..
      }
..
} {code}
The heartbeat requests are sent at the interval of 3000ms (by default) from the consumer. It has no backoff mechanism. The thread pool for data-plane-kafka-request-handler will be full soon.

Fix: 

Instead of waiting for the lock, we can just try to acquire the lock (probably with a time limit). If the acquisition fails, this request can be discarded so that other requests (which include the retry of the discarded one) can be processed. However, we feel this fix would affect the semantic of many operations. We would like to hear some suggestions from the community.

  was:
In Kafka-2.8.0, we found that the number of data plane Kafka request handlers may quickly reach the limit when only one request is stuck. As a result, all other requests that require a data plane request handler will be stuck.

When there is a slowdown inside the storeOffsets function at line 777 due to I/O operation, the thread holds the lock acquired at line 754.

 
{code:java}
  private def doCommitOffsets(group: GroupMetadata,
                              memberId: String,
                              groupInstanceId: Option[String],
                              generationId: Int,
                              offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
                              responseCallback: immutable.Map[TopicPartition, Errors] => Unit): Unit = {
    group.inLock { // Line 754
..
      groupManager.storeOffsets(....) // Line 777
..
  }
} {code}
Its call stack is:

 
{code:java}
kafka.coordinator.group.GroupMetadata,inLock,227
kafka.coordinator.group.GroupCoordinator,handleCommitOffsets,755
kafka.server.KafkaApis,handleOffsetCommitRequest,515
kafka.server.KafkaApis,handle,175
kafka.server.KafkaRequestHandler,run,74
java.lang.Thread,run,748 {code}
This happens when the broker is handling the commit offset request from the consumer. When the slowdown mentioned above makes consumers get no response back, the consumer will automatically resend the request to the broker. Note that each request from the consumer is handled by a data-plane-kafka-request-handler thread. Therefore, another data-plane-kafka-request-handler thread will be also stuck at line 754 when handling the retry requests, because it tries to acquire the very same lock of the consumer group. The retry will occur repeatedly, and none of them can succeed. As a result, the pool of data-plane-kafka-request-handler threads will be full. Note that this pool of threads is responsible for handling all such requests from all producers and consumers. As a result, all the producers and consumers would be affected.

However, the backoff mechanism might be able to solve this issue, by reducing the number of requests in a short time and reserving more slots in the thread pool. Therefore, we increase the backoff config “retry-backoff-ms”, to see if the issue disappears. Specifically, we increase the retry backoff from 100ms (default) to 1000ms in consumer’s config. However, we found that the mentioned thread pool is full again, because there are multiple heartbeat requests that take up the slots of this thread pool. All those heartbeat request handling is stuck when they are acquiring the same consumer group lock, which has been acquired at line 754 as mentioned. Specifically, the heartbeat handling is stuck at GroupCoordinator.handleHeartbeat@624:
{code:java}
  def handleHeartbeat(groupId: String,
                      memberId: String,
                      groupInstanceId: Option[String],
                      generationId: Int,
                      responseCallback: Errors => Unit): Unit = {
..
      case Some(group) => group.inLock { // Line 624
..
      }
..
} {code}
The heartbeat requests are sent at the interval of 3000ms (by default) from the consumer. It has no backoff mechanism. The thread pool for data-plane-kafka-request-handler will be full soon.

Fix: 

Instead of waiting for the lock, we can just try to acquire the lock (probably with a time limit). If the acquisition fails, this request can be discarded so that other requests (which include the retry of the discarded one) can be processed. However, we feel this fix would affect the semantic of many operations. We would like to hear some suggestions from the community.


> Broker request handler thread pool is full due to single request slowdown
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-14886
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14886
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 2.8.0
>            Reporter: Haoze Wu
>            Priority: Major
>
> In Kafka-2.8.0, we found that the number of data plane Kafka request handlers may quickly reach the limit when only one request is stuck. As a result, all other requests that require a data plane request handler will be stuck.
> When there is a slowdown inside the storeOffsets function at line 777 due to I/O operation, the thread holds the lock acquired at line 754.
> {code:java}
>   private def doCommitOffsets(group: GroupMetadata,
>                               memberId: String,
>                               groupInstanceId: Option[String],
>                               generationId: Int,
>                               offsetMetadata: immutable.Map[TopicPartition, OffsetAndMetadata],
>                               responseCallback: immutable.Map[TopicPartition, Errors] => Unit): Unit = {
>     group.inLock { // Line 754
> ..
>       groupManager.storeOffsets(....) // Line 777
> ..
>   }
> } {code}
> Its call stack is:
> {code:java}
> kafka.coordinator.group.GroupMetadata,inLock,227
> kafka.coordinator.group.GroupCoordinator,handleCommitOffsets,755
> kafka.server.KafkaApis,handleOffsetCommitRequest,515
> kafka.server.KafkaApis,handle,175
> kafka.server.KafkaRequestHandler,run,74
> java.lang.Thread,run,748 {code}
> This happens when the broker is handling the commit offset request from the consumer. When the slowdown mentioned above makes consumers get no response back, the consumer will automatically resend the request to the broker. Note that each request from the consumer is handled by a data-plane-kafka-request-handler thread. Therefore, another data-plane-kafka-request-handler thread will be also stuck at line 754 when handling the retry requests, because it tries to acquire the very same lock of the consumer group. The retry will occur repeatedly, and none of them can succeed. As a result, the pool of data-plane-kafka-request-handler threads will be full. Note that this pool of threads is responsible for handling all such requests from all producers and consumers. As a result, all the producers and consumers would be affected.
> However, the backoff mechanism might be able to solve this issue, by reducing the number of requests in a short time and reserving more slots in the thread pool. Therefore, we increase the backoff config “retry-backoff-ms”, to see if the issue disappears. Specifically, we increase the retry backoff from 100ms (default) to 1000ms in consumer’s config. However, we found that the mentioned thread pool is full again, because there are multiple heartbeat requests that take up the slots of this thread pool. All those heartbeat request handling is stuck when they are acquiring the same consumer group lock, which has been acquired at line 754 as mentioned. Specifically, the heartbeat handling is stuck at GroupCoordinator.handleHeartbeat@624:
> {code:java}
>   def handleHeartbeat(groupId: String,
>                       memberId: String,
>                       groupInstanceId: Option[String],
>                       generationId: Int,
>                       responseCallback: Errors => Unit): Unit = {
> ..
>       case Some(group) => group.inLock { // Line 624
> ..
>       }
> ..
> } {code}
> The heartbeat requests are sent at the interval of 3000ms (by default) from the consumer. It has no backoff mechanism. The thread pool for data-plane-kafka-request-handler will be full soon.
> Fix: 
> Instead of waiting for the lock, we can just try to acquire the lock (probably with a time limit). If the acquisition fails, this request can be discarded so that other requests (which include the retry of the discarded one) can be processed. However, we feel this fix would affect the semantic of many operations. We would like to hear some suggestions from the community.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)