You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Chia-Ping Tsai (Jira)" <ji...@apache.org> on 2020/07/02 21:07:00 UTC

[jira] [Commented] (KAFKA-10228) producer: NETWORK_EXCEPTION is thrown instead of a request timeout

    [ https://issues.apache.org/jira/browse/KAFKA-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150583#comment-17150583 ] 

Chia-Ping Tsai commented on KAFKA-10228:
----------------------------------------

It seems the timeout is processed by client local and the error is always defined to Errors.NETWORK_EXCEPTION.

{code}
    private void handleTimedOutRequests(List<ClientResponse> responses, long now) {
        List<String> nodeIds = this.inFlightRequests.nodesWithTimedOutRequests(now);
        for (String nodeId : nodeIds) {
            // close connection to the node
            this.selector.close(nodeId);
            log.debug("Disconnecting from node {} due to request timeout.", nodeId);
            processDisconnection(responses, nodeId, now, ChannelState.LOCAL_CLOSE);
        }
    }
{code}


{code}
        if (response.wasDisconnected()) {
            log.trace("Cancelled request with header {} due to node {} being disconnected",
                requestHeader, response.destination());
            for (ProducerBatch batch : batches.values())
                completeBatch(batch, new ProduceResponse.PartitionResponse(Errors.NETWORK_EXCEPTION), correlationId, now);
{code} 

Perhaps we can add an new flag, which is similar to "disconnected", to indicate this disconnection is caused by local timeout.

> producer: NETWORK_EXCEPTION is thrown instead of a request timeout
> ------------------------------------------------------------------
>
>                 Key: KAFKA-10228
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10228
>             Project: Kafka
>          Issue Type: Improvement
>          Components: clients
>    Affects Versions: 2.3.1
>            Reporter: Christian Becker
>            Priority: Major
>
> We're currently seeing an issue with the java client (producer), when message producing runs into a timeout. Namely a NETWORK_EXCEPTION is thrown instead of a timeout exception.
> *Situation and relevant code:*
> Config
> {code:java}
> request.timeout.ms: 200
> retries: 3
> acks: all{code}
> {code:java}
> for (UnpublishedEvent event : unpublishedEvents) {
>     ListenableFuture<SendResult<String, String>> future;
>     future = kafkaTemplate.send(new ProducerRecord<>(event.getTopic(), event.getKafkaKey(), event.getPayload()));
>     futures.add(future.completable());
> }
> CompletableFuture.allOf(futures.stream().toArray(CompletableFuture[]::new)).join();{code}
> We're using the KafkaTemplate from SpringBoot here, but it shouldn't matter, as it's merely a wrapper. There we put in batches of messages to be sent.
> 200ms later, we can see the following in the logs: (not sure about the order, they've arrived in the same ms, so our logging system might not display them in the right order)
> {code:java}
> [Producer clientId=producer-1] Received invalid metadata error in produce request on partition events-6 due to org.apache.kafka.common.errors.NetworkException: The server disconnected before a response was received.. Going to request metadata update now
> [Producer clientId=producer-1] Got error produce response with correlation id 3094 on topic-partition events-6, retrying (2 attempts left). Error: NETWORK_EXCEPTION {code}
> There is also a corresponding error on the broker (within a few ms):
> {code:java}
> Attempting to send response via channel for which there is no open connection, connection id XXX (kafka.network.Processor) {code}
> This was somewhat unexpected and sent us for a hunt across the infrastructure for possible connection issues, but we've found none.
> Side note: In some cases the retries worked and the messages were successfully produced.
> Only after many hours of heavy debugging, we've noticed, that the error might be related to the low timeout setting. We've removed that setting now, as it was a remnant from the past and no longer valid for our use-case. However in order to avoid other people having that issue again and to simplify future debugging, some form of timeout exception should be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)