You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "John Roesler (Jira)" <ji...@apache.org> on 2020/08/31 20:21:00 UTC
[jira] [Commented] (KAFKA-10429) Group Coordinator unavailability leads to missing events

    [ https://issues.apache.org/jira/browse/KAFKA-10429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187992#comment-17187992 ] 

John Roesler commented on KAFKA-10429:
--------------------------------------

Hi Navinder,

My first thought is that version 1.1.1 is extremely old, and a lot has actually changed in the consumers since then. Is there any chance you can try with a newer version of Streams and see if you still observe the issue?

Aside from that, from the logs you posted, it looks like in only took that instance a few seconds to re-acquire the connection to the coordinator, but the next paragraph implies that disconnections have lasted hours. Can you clarify?

A few other notes:
 * Disconnecting from the coordinator shouldn't interrupt processing, since you can still fetch from the leader and followers of the topic partitions you're assigned
 * If an instance is disconnected for longer than the session interval, you would actually see rebalances caused by that interval having dropped out of the group
 * If the log cleaner removes some offsets after the consumer's current position, there would be an InvalidOffsetException (unless there's an auto-reset policy configured), so you wouldn't silently miss data 

> Group Coordinator unavailability leads to missing events
> --------------------------------------------------------
>
>                 Key: KAFKA-10429
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10429
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 1.1.1
>            Reporter: Navinder Brar
>            Priority: Major
>
> We are regularly getting this Exception in logs.
> [2020-08-25 03:24:59,214] INFO [Consumer clientId=appId-StreamThread-1-consumer, groupId=dashavatara] Group coordinator ip:9092 (id: 1452096777 rack: null) is *unavailable* or invalid, will attempt rediscovery (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
>  
> And after sometime it becomes discoverable:
> [2020-08-25 03:25:02,218] INFO [Consumer clientId=appId-c3d1d186-e487-4993-ae3d-5fed75887e6b-StreamThread-1-consumer, groupId=appId] Discovered group coordinator ip:9092 (id: 1452096777 rack: null) (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
>  
> Now, the doubt I have is why this unavailability doesn't trigger a rebalance in the cluster. We have few hours of retention on the source Kafka Topics and sometimes this unavailability stays over for more than few hours and since it doesn't trigger a rebalance or stops processing on other nodes(which are connected to GC) we never come to know that some issue has happened and till then we lose events from our source topics. 
>  
> There are some resolutions mentioned on stackoverflow but those configs are already set in our kafka:
> default.replication.factor=3
> offsets.topic.replication.factor=3
>  
> It would be great to understand why this issue is happening and why it doesn't trigger a rebalance and is there any known solution for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)