You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Ben Stopford (JIRA)" <ji...@apache.org> on 2015/12/01 18:39:11 UTC

[jira] [Commented] (KAFKA-2909) Example of Gap in Consumption after Restart

    [ https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034158#comment-15034158 ] 

Ben Stopford commented on KAFKA-2909:
-------------------------------------

This problem is producer-side

> Example of Gap in Consumption after Restart
> -------------------------------------------
>
>                 Key: KAFKA-2909
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2909
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: consumer
>            Reporter: Ben Stopford
>            Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 2s sleep between restarts. Throughput was limited to 1000 messages per second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files worker7/kafka-data-logs/test_topic-1/00000000000000000000.log | grep 'offset: 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> So in this instance the data is not there! 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the time the third node goes down. 
> The failed writes correlate with the consumer logging out NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these messages though over a long period so it’s hard to infer this as being the cause or specifically correlating with the error. 
> *Timeline*
> {quote}
> grep -r 'shutdown complete' *
> 20:42:06,132 - Node 1 shutdown completed 
> 20:42:18,560 - Node 2 shutdown completed 
> 20:42:30,185 - *Writes that never make it are written by producer*
> 20:42:31,164 - Node 3 shutdown completed 
> 20:42:57,872 - Node 1 shutdown completed 
> …
> {quote}
> Logging was verbose during this run so log files can be found here: [click me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)