You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "F Méthot (Jira)" <ji...@apache.org> on 2021/09/03 17:18:00 UTC

[jira] [Updated] (KAFKA-13272) KStream offset stuck with exactly_once enabled after brokers outage

     [ https://issues.apache.org/jira/browse/KAFKA-13272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

F Méthot updated KAFKA-13272:
-----------------------------
    Summary: KStream offset stuck with exactly_once enabled after brokers outage  (was: KStream offset stuck with exactly_once after brokers outage)

> KStream offset stuck with exactly_once enabled after brokers outage
> -------------------------------------------------------------------
>
>                 Key: KAFKA-13272
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13272
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.8.0
>         Environment: Kafka running on Kubernetes
> centos
>            Reporter: F Méthot
>            Priority: Major
>
> Our KStream app offset stay stuck with exactly_once after outage.
> Running with KStream 2.8, kafka broker 2.8,
> 3 brokers.
> commands topic is 10 partitions (replication 2, min-insync 2)
> command-expiry-store-changelog topic is 10 partitions (replication 2, min-insync 2)
> events topic is 10 partitions (replication 2, min-insync 2)
> with this topology
> Topologies:
>  
> {code:java}
> Sub-topology: 0
>  Source: KSTREAM-SOURCE-0000000000 (topics: [commands])
>  --> KSTREAM-TRANSFORM-0000000001
>  Processor: KSTREAM-TRANSFORM-0000000001 (stores: [])
>  --> KSTREAM-TRANSFORM-0000000002
>  <-- KSTREAM-SOURCE-0000000000
>  Processor: KSTREAM-TRANSFORM-0000000002 (stores: [command-expiry-store])
>  --> KSTREAM-SINK-0000000003
>  <-- KSTREAM-TRANSFORM-0000000001
>  Sink: KSTREAM-SINK-0000000003 (topic: events)
>  <-- KSTREAM-TRANSFORM-0000000002
> {code}
> h3. 
> Attempt 1 at reproducing this issue
>  
> Our stream app runs with processing.guarantee *exactly_once* 
> After a Kafka test outage where all 3 brokers pod were deleted at the same time,
> Brokers restarted and initialized succesfuly.
> When restarting the topology above, one of the tasks would never initialize fully, the restore phase would keep outputting this messages every few minutes:
>  
> {code:java}
> 2021-08-16 14:20:33,421 INFO stream-thread [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] Restoration in progress for 1 partitions. {commands-processor-expiry-store-changelog-8: position=11775908, end=11775911, totalRestored=2002076} [commands-processor-51b0a534-34b6-47a4-bd9c-c57b6ecb8665-StreamThread-1] (org.apache.kafka.streams.processor.internals.StoreChangelogReader)
> {code}
> Task for partition 8 would never initialize, no more data would be read from the source commands topic for that partition.
>  
> In an attempt to recover, we restarted the stream app with stream processing.guarantee back to at_least_once, than it proceed with reading the changelog and restoring partition 8 fully.
> But we noticed afterward, for the next hour until we rebuilt the system, that partition 8 from command-expiry-store-changelog would not be cleaned/compacted by the log cleaner/compacter compared to other partitions. (could be unrelated, because we have seen that before)
> So we resorted to delete/recreate our command-expiry-store-changelog topic and events topic and regenerate it from the commands, reading from beginning. 
> Things went back to normal
> h3. Attempt 2 at reproducing this issue
> We force-deleted all 3 pod running kafka.
> After that, one of the partition can’t be restored. (like reported in previous attempt)
> For that partition, we noticed these logs on the broker
> {code:java}
> [2021-08-27 17:45:32,799] INFO [Transaction Marker Channel Manager 1002]: Couldn’t find leader endpoint for partitions Set(__consumer_offsets-11, command-expiry-store-changelog-9) while trying to send transaction markers for commands-processor-0_9, these partitions are likely deleted already and hence can be skipped (kafka.coordinator.transaction.TransactionMarkerChannelManager){code}
> Then
> - we stop the kstream app,
> - restarted kafka brokers cleanly
> - Restarting the Kstream app, 
> Those logs messages showed up on the kstream app log:
>  
> {code:java}
> 2021-08-27 18:34:42,413 INFO [Consumer clientId=commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1-consumer, groupId=commands-processor] The following partitions still have unstable offsets which are not cleared on the broker side: [commands-9], this could be either transactional offsets waiting for completion, or normal offsets waiting for replication after appending to local log [commands-processor-76602c87-f682-4648-859b-8fa9b6b937f3-StreamThread-1] (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
>  
> {code}
> This would cause our processor to not consume from that specific source topic-partition.
>  Deleting downstream topic and replaying data would NOT fix the issue (EXACTLY_ONCE or AT_LEAST_ONCE)
> Workaround found:
> Deleted the group associated with the processor, and restarted the kstream application, application went on to process data normally. (We have resigned to use AT_LEAST_ONCE for now )
> KStream config :
> StreamsConfig.RETRY_BACKOFF_MS_CONFIG, 2000
> StreamsConfig.REPLICATION_FACTOR_CONFIG, 2
> StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000
> StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 24MB
> ConsumerConfig.AUTO_OFFSET_RESET_CONFIG), “earliest”
> StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE (now AT_LEAST_ONCE)
> producer.delivery.timeout.ms=120000
> consumer.session.timeout.ms=30000
> consumer.heartbeat.interval.ms=10000
> consumer.max.poll.interval.ms=300000
> num.stream.threads=1
>  
> We will be doing more of test and I will update the ticket with any findings.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)