You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Dmitry (Jira)" <ji...@apache.org> on 2020/05/18 07:20:00 UTC

[jira] [Created] (KAFKA-10013) Consumer hang-up in case of unclean leader election

Dmitry created KAFKA-10013:
------------------------------

             Summary: Consumer hang-up in case of unclean leader election
                 Key: KAFKA-10013
                 URL: https://issues.apache.org/jira/browse/KAFKA-10013
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 2.4.1, 2.5.0, 2.3.1, 2.4.0, 2.3.0
            Reporter: Dmitry


Starting from kafka 2.3 new offset reset negotiation algorithm added (org.apache.kafka.clients.consumer.internals.Fetcher#validateOffsetsAsync)

During this validation, Fetcher `org.apache.kafka.clients.consumer.internals.SubscriptionState` is held in `AWAIT_VALIDATION` fetch state.

This effectively means that fetch requests are not issued and consumption stopped.
In case if unclean leader election is happening during this time, `LogTruncationException` is thrown from future listener in method `validateOffsetsAsync` (probably in order to turn on the logic defined by `auto.offset.reset` parameter).

The main problem is that this exception (thrown from listener of future) is effectively swallowed by `org.apache.kafka.clients.consumer.internals.AsyncClient#sendAsyncRequest`
by this part of code


} catch (RuntimeException e) {
  if (!future.isDone()) {
    future.raise(e);
  }
}

In the end the result is: The only way to get out of AWAIT_VALIDATION and continue consumption is to successfully finish validation, but it can not be finished.
However - consumer is alive, but is consuming nothing. The only way to resume consumption is to terminate consumer and start another one.

We discovered this situation by means of kstreams application, where valid value of `auto.offset.reset` provided by our code is replaced by `None` value for a purpose of position reset (org.apache.kafka.streams.processor.internals.StreamThread#create).
And with kstreams it is even worse, as application may be working, logging warn messages of format `Truncation detected for partition ...,` but data is not generated for a long time and in the end is lost, making kstreams application unreliable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)