You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Kirk True (Jira)" <ji...@apache.org> on 2023/02/07 15:36:00 UTC

[jira] [Commented] (KAFKA-14317) ProduceRequest timeouts are logged as network exceptions

    [ https://issues.apache.org/jira/browse/KAFKA-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685353#comment-17685353 ] 

Kirk True commented on KAFKA-14317:
-----------------------------------

[~dajac] could you take another look at the PR? Thanks!

> ProduceRequest timeouts are logged as network exceptions
> --------------------------------------------------------
>
>                 Key: KAFKA-14317
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14317
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, logging, producer 
>    Affects Versions: 3.3.0
>            Reporter: Kirk True
>            Assignee: Kirk True
>            Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> In {{{}NetworkClient.handleTimedOutRequests{}}}, we disconnect the broker connection:
> {code:java}
> private void handleTimedOutRequests(List<ClientResponse> responses, long now) {
>     List<String> nodeIds = this.inFlightRequests.nodesWithTimedOutRequests(now);
>     for (String nodeId : nodeIds) {
>         // close connection to the node
>         this.selector.close(nodeId);
>         log.info("Disconnecting from node {} due to request timeout.", nodeId);
>         processDisconnection(responses, nodeId, now, ChannelState.LOCAL_CLOSE);
>     }
> }
> {code}
> This calls {{processDisconnection}} which calls {{{}cancelInFlightRequests{}}}:
> {code:java}
> for (InFlightRequest request : inFlightRequests) {
>     if (log.isDebugEnabled()) {
>         log.debug("Cancelled in-flight {} request with correlation id {} due to node {} being disconnected " +
>                 "(elapsed time since creation: {}ms, elapsed time since send: {}ms, request timeout: {}ms): {}",
>             request.header.apiKey(), request.header.correlationId(), nodeId,
>             request.timeElapsedSinceCreateMs(now), request.timeElapsedSinceSendMs(now),
>             request.requestTimeoutMs, request.request);
>     } else {
>         log.info("Cancelled in-flight {} request with correlation id {} due to node {} being disconnected " +
>                 "(elapsed time since creation: {}ms, elapsed time since send: {}ms, request timeout: {}ms)",
>             request.header.apiKey(), request.header.correlationId(), nodeId,
>             request.timeElapsedSinceCreateMs(now), request.timeElapsedSinceSendMs(now),
>             request.requestTimeoutMs);
>     }
>     if (!request.isInternalRequest) {
>         if (responses != null)
>             responses.add(request.disconnected(now, null));
>     } else if (request.header.apiKey() == ApiKeys.METADATA) {
>         metadataUpdater.handleFailedRequest(now, Optional.empty());
>     }
> }
> {code}
> We create a new {{ClientResponse}} in which the {{disconnected}} flag is set.
> We then complete the record batch In {{Sender.handleProduceResponse}} with:
> {code:java}
> if (response.wasDisconnected()) {
>     log.trace("Cancelled request with header {} due to node {} being disconnected",
>     requestHeader, response.destination());
>     for (ProducerBatch batch : batches.values())
>         completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION, String.format("Disconnected from node %s", response.destination())),
>     correlationId, now);
> }
> {code}
> This seems like it could be confusing for customers that they would see network exceptions on a request timeout instead of a timeout error.
> One implication of completing the batch with a network exception is that the producer will try to refresh metadata after a request timeout. I can see arguments for why this is necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)