You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Nanda Kishore M S (Jira)" <ji...@apache.org> on 2021/07/07 13:27:00 UTC
[jira] [Created] (KAFKA-13044) __consumer_offsets corruption
Nanda Kishore M S created KAFKA-13044:
-----------------------------------------
Summary: __consumer_offsets corruption
Key: KAFKA-13044
URL: https://issues.apache.org/jira/browse/KAFKA-13044
Project: Kafka
Issue Type: Bug
Components: offset manager
Affects Versions: 2.5.0
Environment: Amazon Linux
Kafka Server: 2.5.0, scala version - 2.12
Reporter: Nanda Kishore M S
We had an issue where clients are not able to discover a group coordinator and we could see the following from client logs when we tried to read data from a topic via kafka-console-consumer
{{[2021-07-06 08:15:14,499] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Sending FindCoordinator request to broker kafka01-broker:9094 (id: 5 rack: us-west-2b) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,504] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Received FindCoordinator response ClientResponse(receivedTimeMs=1625559314504, latencyMs=5, disconnected=false, requestHeader=RequestHeader(apiKey=FIND_COORDINATOR, apiVersion=3, clientId=consumer-test-consumer-group-1-1, correlationId=32), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, errorCode=0, errorMessage='NONE', nodeId=3, host='kafka02-broker', port=9094)) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,504] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Discovered group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,504] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery}}
{{}}
{{ (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,504] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Sending FindCoordinator request to broker kafka01-broker:9094 (id: 5 rack: us-west-2b) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,507] DEBUG [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Received FindCoordinator response ClientResponse(receivedTimeMs=1625559314504, latencyMs=5, disconnected=false, requestHeader=RequestHeader(apiKey=FIND_COORDINATOR, apiVersion=3, clientId=consumer-test-consumer-group-1-1, correlationId=32), responseBody=FindCoordinatorResponseData(throttleTimeMs=0, errorCode=0, errorMessage='NONE', nodeId=3, host='kafka02-broker', port=9094)) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,507] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Discovered group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2021-07-06 08:15:14,507] INFO [Consumer clientId=consumer-test-consumer-group-1-1, groupId=test-consumer-group-1] Group coordinator kafka02-broker:9094 (id: 2147483644 rack: null) is unavailable or invalid, will attempt rediscovery (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)}}
{{}}
{{and }}so on
We had a look at __consumer_offsets topic and the data looks a bit weird for 7 partitions where isr set and replica set are mutually exclusive.
./kafka-topics.sh --describe --zookeeper localhost:2181 --topic __consumer_offsets
Topic: __consumer_offsets PartitionCount: 50 ReplicationFactor: 3 Configs: compression.type=producer,cleanup.policy=compact,segment.bytes=104857600
Topic: __consumer_offsets Partition: 0 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3
*Topic: __consumer_offsets Partition: 1 Leader: 3 Replicas: 6,4,5 Isr: 3,2*
Topic: __consumer_offsets Partition: 2 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
Topic: __consumer_offsets Partition: 3 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6
*Topic: __consumer_offsets Partition: 4 Leader: 6 Replicas: 3,1,2 Isr: 6,5*
Topic: __consumer_offsets Partition: 5 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
Topic: __consumer_offsets Partition: 6 Leader: 5 Replicas: 5,6,1 Isr: 1,5,6
Topic: __consumer_offsets Partition: 7 Leader: 6 Replicas: 6,1,2 Isr: 1,2,6
Topic: __consumer_offsets Partition: 8 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3
Topic: __consumer_offsets Partition: 9 Leader: 2 Replicas: 2,3,4 Isr: 4,2,3
Topic: __consumer_offsets Partition: 10 Leader: 3 Replicas: 3,4,5 Isr: 4,5,3
Topic: __consumer_offsets Partition: 11 Leader: 4 Replicas: 4,5,6 Isr: 4,5,6
Topic: __consumer_offsets Partition: 12 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3
Topic: __consumer_offsets Partition: 13 Leader: 6 Replicas: 6,4,5 Isr: 4,5,6
Topic: __consumer_offsets Partition: 14 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
Topic: __consumer_offsets Partition: 15 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6
Topic: __consumer_offsets Partition: 16 Leader: 3 Replicas: 3,1,2 Isr: 1,2,3
Topic: __consumer_offsets Partition: 17 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
*Topic: __consumer_offsets Partition: 18 Leader: 2 Replicas: 5,1,3 Isr: 2,6*
Topic: __consumer_offsets Partition: 19 Leader: 6 Replicas: 6,2,4 Isr: 4,2,6
Topic: __consumer_offsets Partition: 20 Leader: 1 Replicas: 1,3,5 Isr: 1,5,3
*Topic: __consumer_offsets Partition: 21 Leader: 5 Replicas: 2,4,6 Isr: 5,3*
Topic: __consumer_offsets Partition: 22 Leader: 3 Replicas: 3,5,1 Isr: 1,5,3
Topic: __consumer_offsets Partition: 23 Leader: 4 Replicas: 4,6,2 Isr: 4,2,6
Topic: __consumer_offsets Partition: 24 Leader: 5 Replicas: 5,4,6 Isr: 4,5,6
Topic: __consumer_offsets Partition: 25 Leader: 6 Replicas: 6,5,1 Isr: 1,5,6
Topic: __consumer_offsets Partition: 26 Leader: 1 Replicas: 1,6,2 Isr: 1,2,6
Topic: __consumer_offsets Partition: 27 Leader: 2 Replicas: 2,1,3 Isr: 1,2,3
Topic: __consumer_offsets Partition: 28 Leader: 3 Replicas: 3,2,4 Isr: 4,2,3
Topic: __consumer_offsets Partition: 29 Leader: 4 Replicas: 4,3,5 Isr: 4,5,3
Topic: __consumer_offsets Partition: 30 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3
*Topic: __consumer_offsets Partition: 31 Leader: 3 Replicas: 6,4,5 Isr: 3,2*
Topic: __consumer_offsets Partition: 32 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
Topic: __consumer_offsets Partition: 33 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6
Topic: __consumer_offsets Partition: 34 Leader: 6 Replicas: 3,1,2 Isr: 6,5
Topic: __consumer_offsets Partition: 35 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
Topic: __consumer_offsets Partition: 36 Leader: 5 Replicas: 5,6,1 Isr: 1,5,6
Topic: __consumer_offsets Partition: 37 Leader: 6 Replicas: 6,1,2 Isr: 1,2,6
*Topic: __consumer_offsets Partition: 38 Leader: 1 Replicas: 1,2,3 Isr: 1,2,3*
Topic: __consumer_offsets Partition: 39 Leader: 2 Replicas: 2,3,4 Isr: 4,2,3
Topic: __consumer_offsets Partition: 40 Leader: 3 Replicas: 3,4,5 Isr: 4,5,3
Topic: __consumer_offsets Partition: 41 Leader: 4 Replicas: 4,5,6 Isr: 4,5,6
Topic: __consumer_offsets Partition: 42 Leader: 5 Replicas: 5,3,4 Isr: 4,5,3
Topic: __consumer_offsets Partition: 43 Leader: 6 Replicas: 6,4,5 Isr: 4,5,6
Topic: __consumer_offsets Partition: 44 Leader: 1 Replicas: 1,5,6 Isr: 1,5,6
Topic: __consumer_offsets Partition: 45 Leader: 2 Replicas: 2,6,1 Isr: 1,2,6
Topic: __consumer_offsets Partition: 46 Leader: 3 Replicas: 3,1,2 Isr: 1,2,3
Topic: __consumer_offsets Partition: 47 Leader: 4 Replicas: 4,2,3 Isr: 4,2,3
*Topic: __consumer_offsets Partition: 48 Leader: 2 Replicas: 5,1,3 Isr: 2,6*
Topic: __consumer_offsets Partition: 49 Leader: 6 Replicas: 6,2,4 Isr: 4,2,6
Looking at the source code, in the class {{AbstractCoordinator.java}}
{{ client.isUnavailable(coordinator) seem to return true and hence the endless loop.}}
{{}}
{code:java}
protected synchronized boolean ensureCoordinatorReady(final Timer timer) {
...
} else if (coordinator != null && client.isUnavailable(coordinator)) {
// we found the coordinator, but the connection has failed, so mark
// it dead and backoff before retrying discovery
markCoordinatorUnknown();
timer.sleep(rebalanceConfig.retryBackoffMs);
}
{code}
{{We are able to find a workaround by re-assigning the highlighted partitions by running kafka-reassign-partitions.sh by replacing replica values with isr values.}}
However, we are wondering what would have caused this corruption. The brokers have been running for the past 54 days and we have not done any upgrade recently.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)