You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Abhijit Patil (Jira)" <ji...@apache.org> on 2022/10/19 10:39:00 UTC

[jira] [Created] (KAFKA-14322) Kafka node eating Disk continuously

Abhijit Patil created KAFKA-14322:
-------------------------------------

             Summary: Kafka node eating Disk continuously 
                 Key: KAFKA-14322
                 URL: https://issues.apache.org/jira/browse/KAFKA-14322
             Project: Kafka
          Issue Type: Bug
          Components: log, log cleaner
    Affects Versions: 2.8.1
            Reporter: Abhijit Patil
         Attachments: image-2022-10-19-15-51-52-735.png, image-2022-10-19-15-53-39-928.png

We have 2.8.1 Kafka cluster in our Production environment. It has it continuously growing disk consumption and eating all disk space allocated and crash node with no disk space left

!image-2022-10-19-15-51-52-735.png|width=344,height=194!

 

!image-2022-10-19-15-53-39-928.png|width=470,height=146!

[Log partition=__consumer_offsets-41, dir=/var/lib/kafka/data/kafka-log0] Rolled new log segment at offset 10537467423 in 4 ms. (kafka.log.Log) [data-plane-kafka-request-handler-4]"


I can see that for node 0 for partition __consumer_offsets-41 its rolling new segment however its never got cleanup.
This is the root cause for disk uses increase.  



Due to some condition/bug/trigger, something internally has gone wrong with the consumer offset coordinator thread and it has gone berserk!  
 
Take a look at the consumer-offset logs below it's generating. If you take a closer look it's the same data it's writing in a loop forever. The product topic in question doesn't have any traffic. This is generating an insane amount of consumer-offset logs which currently amounts to *571GB* and this is endless no matter how much terabytes we add it will eat it eventually{*}.{*}  

 One more thing the consumer offset logs it's generating also marking everything as invalid that you can in the second log dump below.
 
-kafka-0 data]$ du -sh kafka-log0/__consumer_offsets-*
12K kafka-log0/__consumer_offsets-11
12K kafka-log0/__consumer_offsets-14
12K kafka-log0/__consumer_offsets-17
12K kafka-log0/__consumer_offsets-2
12K kafka-log0/__consumer_offsets-20
12K kafka-log0/__consumer_offsets-23
12K kafka-log0/__consumer_offsets-26
12K kafka-log0/__consumer_offsets-29
12K kafka-log0/__consumer_offsets-32
12K kafka-log0/__consumer_offsets-35
12K kafka-log0/__consumer_offsets-38
*588G* kafka-log0/__consumer_offsets-41
48K kafka-log0/__consumer_offsets-44
12K kafka-log0/__consumer_offsets-47
12K kafka-log0/__consumer_offsets-5
12K kafka-log0/__consumer_offsets-8
 
[response-consumer,feature.response.topic,2]::OffsetAndMetadata(offset=107, leaderEpoch=Optional[23], metadata=, commitTimestamp=1664883985122, expireTimestamp=None) *[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112, leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985129, expireTimestamp=None)*
 
 
{*}[response-consumer,feature.response.topic,15]::OffsetAndMetadata(offset=112, leaderEpoch=Optional[25], metadata=, commitTimestamp=1664883985139, expireTimestamp=None) [feature-telemetry-response-{*}consumer,.feature.response.topic,13]::OffsetAndMetadata(offset=112, leaderEpoch=Optional[24], metadata=, commitTimestamp=1664883985139, expireTimestamp=None)
 
 
baseOffset: 5616487061 lastOffset: 5616487061 count: 1 baseSequence: 0 lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6 isTransactional: false
isControl: false position: 3423 CreateTime: 1660892213452 size: 175 magic: 2 compresscodec: NONE crc: 1402370404 *isvalid: true*

baseOffset: 5616487062 lastOffset: 5616487062 count: 1 baseSequence: 0 lastSequence: 0 producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 6 isTransactional: false
isControl: false position: 3598 CreateTime: 1660892213462 size: 175 magic: 2 compresscodec: NONE crc: 1105941790 *isvalid: true*
| offset: 5616487062 CreateTime: 1660892213462 keysize: 81 valuesize: 24 sequence: 0 headerKeys: [] key:

For our topics we have below retention configuration 

retention.ms: 86400000
segment.bytes: 1073741824
 
For consumer offset internal topic its default cleanup policy and retention.
 
We suspect this is similar to https://issues.apache.org/jira/browse/KAFKA-9543 

This appear for 1 environment only, same cluster with same configuration works corrctly on other environments. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)