You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Cheng Tan (Jira)" <ji...@apache.org> on 2020/06/02 07:54:00 UTC

[jira] [Comment Edited] (KAFKA-9800) [KIP-580] Client Exponential Backoff Implementation

    [ https://issues.apache.org/jira/browse/KAFKA-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123377#comment-17123377 ] 

Cheng Tan edited comment on KAFKA-9800 at 6/2/20, 7:53 AM:
-----------------------------------------------------------

Recap the discussion in Github. We want to implement a per-request backoff for all types of clients.

 

Let me talk about two of my major concerns and thoughts about implementing the universal client exponential backoff.

 

*AdminClient logic redundant*

NetworkClient has request timeout handlers. Producer / Consumer are using NetworkClient to help handle timeout but AdminClient doesn’t. The reason, to my understanding, is that AdminClient is implementing the per-request timeout.

For example,
 # Wrapping the request builder into a new class {{Call}}, (the construction lambda adds tons of lines into the AdminClient.java, which should probably have been living in each AbstractRequest implementation classes files)
 # Re-writing the request queues for different request status, while normal clients are fully using the NetworkClient.

After we add support to the per-request timeout to all clients, the AdminClient per-request timeout demand won’t be special anymore. Thus, the code for supporting the per-request timeout in AdminClient is not useful anymore and might be removed.

Are we considering refactoring the AdminClient further and remove all the redundant logic which should have belonged to the networking layer and the AbstractRequest implementation classes?

*Flexible backoff modes*

Let's analyze the request backoff demands of all the types of clients at this point. In my opinion, there are simply two:
 # Requests do not need exponential backoff. These requests need to be sent ASAP to avoid dataflow performance degradation, such as the {{ProduceRequest}} and its related/preceding metadata requests.

 # Request do need exponential backoff. These requests are “second-class citizens” and can be throttled to avoid request storms on the broker side. Such as metadata related requests in AdminClient.

Now the question comes. Even when two requests are of the same request type, one may have to get sent ASAP while the other one may wait, depending on the use case. We need to think deeper about how to make a classification.

But the implementation would be simple. We can utilize the existing builder pattern AbstractRequest and build the request flexibly upon a given retry_backoff mode. For example,
 # AbstractRequest.Builder will interact with a new abstract class specifying the retry_backoff option, static or exponential.
 # AbstractRequest will have some new interfaces controlling the backoff.

Then, we can control if the request should have a static backoff or an exponential backoff when we construct each implementation instance of AbstractRequest.Builder.

I'll include more details in the Jira ticket and rewrite this PR. Before we talk more about the code details and start the new implementation, please let me know what you think about the AdminClient refactor and static/exponential retry_backoff classification rule.


was (Author: d8tltanc):
Recap the discussion in Github. We want to implement a per-request backoff for all types of clients.

 

Let me talk about two of my major concerns and thoughts about implementing the universal client exponential backoff.

*AdminClient logic redundant*

The main request flow difference btw AdminClient and normal clients (e.g. Producer and Consumer) would be that AdminClient wants to have a per request timeout while normal clients is okay with a static default timeout. Thus, AdminClient rewrote a quite amount of NetworkClient's functionality.

For example,
 # Wrapping the request builder into a new class {{Call}}, (the construction lambda adds tons of lines into the AdminClient.java, which should probably have been living in each AbstractRequest implementation classes files)
 # Re-writing the request queues for different request status, while normal clients are fully using the NetworkClient.

These logics will become redundant after we support exponential backoff in NetworkClient for all types of clients. Are we considering refactoring the AdminClient further and remove all the redundant logic which should have belonged to the networking layer and the AbstractRequest implementation classes?

*Flexible backoff modes*

Let's analyze the request backoff demands of all the types of clients at this point. In my opinion, there are simply two:
 # Requests do not need exponential backoff. These requests need to be sent ASAP to avoid dataflow performance degradation, such as the {{ProduceRequest}} and its related/preceding metadata requests.

 # Request do need exponential backoff. These requests are “second-class citizens” and can be throttled to avoid request storms on the broker side. Such as metadata related requests in AdminClient.

Now the question comes. Even when two requests are of the same request type, one may have to get sent ASAP while the other one may wait, depending on the use case. We need to think deeper about how to make a classification.

But the implementation would be simple. We can utilize the existing builder pattern AbstractRequest and build the request flexibly upon a given retry_backoff mode. For example,
 # AbstractRequest.Builder will interact with a new abstract class specifying the retry_backoff option, static or exponential.
 # AbstractRequest will have some new interfaces controlling the backoff.

Then, we can control if the request should have a static backoff or an exponential backoff when we construct each implementation instance of AbstractRequest.Builder.

I'll include more details in the Jira ticket and rewrite this PR. Before we talk more about the code details and start the new implementation, please let me know what you think about the AdminClient refactor and static/exponential retry_backoff classification rule.

 

> [KIP-580] Client Exponential Backoff Implementation
> ---------------------------------------------------
>
>                 Key: KAFKA-9800
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9800
>             Project: Kafka
>          Issue Type: New Feature
>            Reporter: Cheng Tan
>            Assignee: Cheng Tan
>            Priority: Major
>              Labels: KIP-580
>
> In {{KafkaAdminClient}}, we will have to modify the way the retry backoff is calculated for the calls that have failed and need to be retried. From the current static retry backoff, we have to introduce a mechanism for all calls that upon failure, the next retry time is dynamically calculated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)