You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Kirk True (Jira)" <ji...@apache.org> on 2023/02/07 15:36:00 UTC
[jira] [Commented] (KAFKA-14317) ProduceRequest timeouts are logged as network exceptions
[ https://issues.apache.org/jira/browse/KAFKA-14317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17685353#comment-17685353 ]
Kirk True commented on KAFKA-14317:
-----------------------------------
[~dajac] could you take another look at the PR? Thanks!
> ProduceRequest timeouts are logged as network exceptions
> --------------------------------------------------------
>
> Key: KAFKA-14317
> URL: https://issues.apache.org/jira/browse/KAFKA-14317
> Project: Kafka
> Issue Type: Bug
> Components: clients, logging, producer
> Affects Versions: 3.3.0
> Reporter: Kirk True
> Assignee: Kirk True
> Priority: Major
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> In {{{}NetworkClient.handleTimedOutRequests{}}}, we disconnect the broker connection:
> {code:java}
> private void handleTimedOutRequests(List<ClientResponse> responses, long now) {
> List<String> nodeIds = this.inFlightRequests.nodesWithTimedOutRequests(now);
> for (String nodeId : nodeIds) {
> // close connection to the node
> this.selector.close(nodeId);
> log.info("Disconnecting from node {} due to request timeout.", nodeId);
> processDisconnection(responses, nodeId, now, ChannelState.LOCAL_CLOSE);
> }
> }
> {code}
> This calls {{processDisconnection}} which calls {{{}cancelInFlightRequests{}}}:
> {code:java}
> for (InFlightRequest request : inFlightRequests) {
> if (log.isDebugEnabled()) {
> log.debug("Cancelled in-flight {} request with correlation id {} due to node {} being disconnected " +
> "(elapsed time since creation: {}ms, elapsed time since send: {}ms, request timeout: {}ms): {}",
> request.header.apiKey(), request.header.correlationId(), nodeId,
> request.timeElapsedSinceCreateMs(now), request.timeElapsedSinceSendMs(now),
> request.requestTimeoutMs, request.request);
> } else {
> log.info("Cancelled in-flight {} request with correlation id {} due to node {} being disconnected " +
> "(elapsed time since creation: {}ms, elapsed time since send: {}ms, request timeout: {}ms)",
> request.header.apiKey(), request.header.correlationId(), nodeId,
> request.timeElapsedSinceCreateMs(now), request.timeElapsedSinceSendMs(now),
> request.requestTimeoutMs);
> }
> if (!request.isInternalRequest) {
> if (responses != null)
> responses.add(request.disconnected(now, null));
> } else if (request.header.apiKey() == ApiKeys.METADATA) {
> metadataUpdater.handleFailedRequest(now, Optional.empty());
> }
> }
> {code}
> We create a new {{ClientResponse}} in which the {{disconnected}} flag is set.
> We then complete the record batch In {{Sender.handleProduceResponse}} with:
> {code:java}
> if (response.wasDisconnected()) {
> log.trace("Cancelled request with header {} due to node {} being disconnected",
> requestHeader, response.destination());
> for (ProducerBatch batch : batches.values())
> completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION, String.format("Disconnected from node %s", response.destination())),
> correlationId, now);
> }
> {code}
> This seems like it could be confusing for customers that they would see network exceptions on a request timeout instead of a timeout error.
> One implication of completing the batch with a network exception is that the producer will try to refresh metadata after a request timeout. I can see arguments for why this is necessary.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)