You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Nanda Kishore M S (Jira)" <ji...@apache.org> on 2021/07/07 13:27:00 UTC

[jira] [Created] (KAFKA-13044) __consumer_offsets corruption

Nanda Kishore M S created KAFKA-13044:
-----------------------------------------

             Summary: __consumer_offsets corruption
                 Key: KAFKA-13044
                 URL: https://issues.apache.org/jira/browse/KAFKA-13044
             Project: Kafka
          Issue Type: Bug
          Components: offset manager
    Affects Versions: 2.5.0
         Environment: Amazon Linux
Kafka Server: 2.5.0, scala version - 2.12
            Reporter: Nanda Kishore M S


We had an issue where clients are not able to discover a group coordinator and we could see the following from client logs when we tried to read data from a topic via kafka-console-consumer

 

{{[2021-07-06 08:15:14,499] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Sending FindCoordinator request to broker kafka01-broker:9094 (id: 5 rack: us-west-2b) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,504] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Received FindCoordinator response ClientResponse(receivedTimeMs=1625559314504, latencyMs=5, disconnected=false, requestHeader=RequestHeader(apiKey=FIND_COORDINATOR, apiVersion=3, clientId=consumer-test-consumer-group-1-1, correlationId=32), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, errorCode=0, errorMessage='NONE', nodeId=3, host='kafka02-broker', port=9094)) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,504] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Discovered group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,504] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery}}

{{}}

{{ (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)

[2021-07-06 08:15:14,504] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Sending FindCoordinator request to broker kafka01-broker:9094 (id: 5 rack: us-west-2b) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,507] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Received FindCoordinator response ClientResponse(receivedTimeMs=1625559314504, latencyMs=5, disconnected=false, requestHeader=RequestHeader(apiKey=FIND_COORDINATOR, apiVersion=3, clientId=consumer-test-consumer-group-1-1, correlationId=32), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, errorCode=0, errorMessage='NONE', nodeId=3, host='kafka02-broker', port=9094)) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,507] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Discovered group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,507] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)}}

{{}}

{{and }}so on

We had a look at __consumer_offsets topic and the data looks a bit weird for 7 partitions where isr set and replica set are mutually exclusive.

 

./kafka-topics.sh --describe --zookeeper localhost:2181 --topic __consumer_offsets
Topic: __consumer_offsets PartitionCount: 50 ReplicationFactor: 3 Configs: compression.type=producer,cleanup.policy=compact,segment.bytes=104857600
 Topic: __consumer_offsets Partition: 0 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3
 *Topic: __consumer_offsets Partition: 1 Leader: 3 Replicas: 6,4,5 Isr: 3,2*
 Topic: __consumer_offsets Partition: 2 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
 Topic: __consumer_offsets Partition: 3 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6
 *Topic: __consumer_offsets Partition: 4 Leader: 6 Replicas: 3,1,2 Isr: 6,5*
 Topic: __consumer_offsets Partition: 5 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
 Topic: __consumer_offsets Partition: 6 Leader: 5 Replicas: 5,6,1 Isr: 1,5,6
 Topic: __consumer_offsets Partition: 7 Leader: 6 Replicas: 6,1,2 Isr: 1,2,6
 Topic: __consumer_offsets Partition: 8 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
 Topic: __consumer_offsets Partition: 9 Leader: 2 Replicas: 2,3,4 Isr: 4,2,3
 Topic: __consumer_offsets Partition: 10 Leader: 3 Replicas: 3,4,5 Isr: 4,5,3
 Topic: __consumer_offsets Partition: 11 Leader: 4 Replicas: 4,5,6 Isr: 4,5,6
 Topic: __consumer_offsets Partition: 12 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3
 Topic: __consumer_offsets Partition: 13 Leader: 6 Replicas: 6,4,5 Isr: 4,5,6
 Topic: __consumer_offsets Partition: 14 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
 Topic: __consumer_offsets Partition: 15 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6
 Topic: __consumer_offsets Partition: 16 Leader: 3 Replicas: 3,1,2 Isr: 1,2,3
 Topic: __consumer_offsets Partition: 17 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
 *Topic: __consumer_offsets Partition: 18 Leader: 2 Replicas: 5,1,3 Isr: 2,6*
 Topic: __consumer_offsets Partition: 19 Leader: 6 Replicas: 6,2,4 Isr: 4,2,6
 Topic: __consumer_offsets Partition: 20 Leader: 1 Replicas: 1,3,5 Isr: 1,5,3
 *Topic: __consumer_offsets Partition: 21 Leader: 5 Replicas: 2,4,6 Isr: 5,3*
 Topic: __consumer_offsets Partition: 22 Leader: 3 Replicas: 3,5,1 Isr: 1,5,3
 Topic: __consumer_offsets Partition: 23 Leader: 4 Replicas: 4,6,2 Isr: 4,2,6
 Topic: __consumer_offsets Partition: 24 Leader: 5 Replicas: 5,4,6 Isr: 4,5,6
 Topic: __consumer_offsets Partition: 25 Leader: 6 Replicas: 6,5,1 Isr: 1,5,6
 Topic: __consumer_offsets Partition: 26 Leader: 1 Replicas: 1,6,2 Isr: 1,2,6
 Topic: __consumer_offsets Partition: 27 Leader: 2 Replicas: 2,1,3 Isr: 1,2,3
 Topic: __consumer_offsets Partition: 28 Leader: 3 Replicas: 3,2,4 Isr: 4,2,3
 Topic: __consumer_offsets Partition: 29 Leader: 4 Replicas: 4,3,5 Isr: 4,5,3
 Topic: __consumer_offsets Partition: 30 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3
 *Topic: __consumer_offsets Partition: 31 Leader: 3 Replicas: 6,4,5 Isr: 3,2*
 Topic: __consumer_offsets Partition: 32 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
 Topic: __consumer_offsets Partition: 33 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6
 Topic: __consumer_offsets Partition: 34 Leader: 6 Replicas: 3,1,2 Isr: 6,5
 Topic: __consumer_offsets Partition: 35 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
 Topic: __consumer_offsets Partition: 36 Leader: 5 Replicas: 5,6,1 Isr: 1,5,6
 Topic: __consumer_offsets Partition: 37 Leader: 6 Replicas: 6,1,2 Isr: 1,2,6
 *Topic: __consumer_offsets Partition: 38 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3*
 Topic: __consumer_offsets Partition: 39 Leader: 2 Replicas: 2,3,4 Isr: 4,2,3
 Topic: __consumer_offsets Partition: 40 Leader: 3 Replicas: 3,4,5 Isr: 4,5,3
 Topic: __consumer_offsets Partition: 41 Leader: 4 Replicas: 4,5,6 Isr: 4,5,6
 Topic: __consumer_offsets Partition: 42 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3
 Topic: __consumer_offsets Partition: 43 Leader: 6 Replicas: 6,4,5 Isr: 4,5,6
 Topic: __consumer_offsets Partition: 44 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
 Topic: __consumer_offsets Partition: 45 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6
 Topic: __consumer_offsets Partition: 46 Leader: 3 Replicas: 3,1,2 Isr: 1,2,3
 Topic: __consumer_offsets Partition: 47 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
 *Topic: __consumer_offsets Partition: 48 Leader: 2 Replicas: 5,1,3 Isr: 2,6*
 Topic: __consumer_offsets Partition: 49 Leader: 6 Replicas: 6,2,4 Isr: 4,2,6
 

 

Looking at the source code, in the class {{AbstractCoordinator.java}}

{{ client.isUnavailable(coordinator) seem to return true and hence the endless loop.}}
{{}}
{code:java}
protected synchronized boolean ensureCoordinatorReady(final Timer timer) {
 ...

 } else if (coordinator != null && client.isUnavailable(coordinator)) {
    // we found the coordinator, but the connection has failed, so mark
    // it dead and backoff before retrying discovery    
    markCoordinatorUnknown();
    timer.sleep(rebalanceConfig.retryBackoffMs);
 }
{code}
 

{{We are able to find a workaround by re-assigning the highlighted partitions by running kafka-reassign-partitions.sh by replacing replica values with isr values.}}

However, we are wondering what would have caused this corruption. The brokers have been running for the past 54 days and we have not done any upgrade recently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)