You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jacek Szewczyk <ja...@gmail.com> on 2020/04/08 08:46:27 UTC

Kafka CPU spikes every 5 minutes

Hi All,
 
I am seeing strange behavior for Kafka 2.0.0.3.1.4. My cluster contains 9 brokers + 3 dedicated zookeepers and for unknown reason there is spike in CPU every 5 minutes which cause timeouts between producers, consumers and brokers. Basically every 5 minutes CPU spikes to 90+ % and at the same time network utilization goes down to almost 0 (and should be in 100MBps range)
Each broker has 64G of memory, Heap set to 9G, there are 8 cores and 4G uplink. I have 500 partitions (replication=2)  and there were around 1000 producers sending data in 1 minute intervals. Summary input rate is around 600k/s
Here is my config:
 
auto.create.topics.enable=false
auto.leader.rebalance.enable=true
compression.type=producer
controlled.shutdown.enable=true
controlled.shutdown.max.retries=3
controlled.shutdown.retry.backoff.ms=5000
controller.message.queue.size=10
controller.socket.timeout.ms=30000
default.replication.factor=2
delete.topic.enable=true
leader.imbalance.check.interval.seconds=300
leader.imbalance.per.broker.percentage=10
listeners=PLAINTEXT://localhost:9092 <plaintext://localhost:9092>
log.cleanup.interval.mins=10
log.dirs=/diskc/kafka-logs,/diskd/kafka-logs,/diske/kafka-logs,/diskf/kafka-logs,/diskg/kafka-logs,/diskh/kafka-logs,/diskj/kafka-logs,/diskk/kafka-logs
log.index.interval.bytes=4096
log.index.size.max.bytes=10485760
log.retention.bytes=-1
log.retention.check.interval.ms=600000
log.retention.hours=24
log.roll.hours=24
log.segment.bytes=1073741824
message.max.bytes=1000000
min.insync.replicas=1
num.io.threads=8
num.network.threads=3000
num.partitions=100
num.recovery.threads.per.data.dir=4
num.replica.fetchers=4
offset.metadata.max.bytes=4096
offsets.commit.required.acks=-1
offsets.commit.timeout.ms=5000
offsets.load.buffer.size=5242880
offsets.retention.check.interval.ms=600000
offsets.retention.minutes=86400000
offsets.topic.compression.codec=0
offsets.topic.num.partitions=50
offsets.topic.replication.factor=3
offsets.topic.segment.bytes=104857600
producer.metrics.enable=false
producer.purgatory.purge.interval.requests=10000
queued.max.requests=500
replica.fetch.max.bytes=1048576
replica.fetch.min.bytes=1
replica.fetch.wait.max.ms=500
replica.high.watermark.checkpoint.interval.ms=5000
replica.lag.max.messages=4000
replica.lag.time.max.ms=10000
replica.socket.receive.buffer.bytes=65536
replica.socket.timeout.ms=30000
sasl.enabled.mechanisms=GSSAPI
sasl.mechanism.inter.broker.protocol=GSSAPI
security.inter.broker.protocol=PLAINTEXT
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
socket.send.buffer.bytes=102400
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
zookeeper.connection.timeout.ms=25000
zookeeper.session.timeout.ms=30000
zookeeper.sync.time.ms=2000
 
In log messages for every spike it starts with shrinking replication and continues with timeouts like this:
INFO [Partition partition-220 broker=1010] Shrinking ISR from 1010,1006 to 1010 (kafka.cluster.Partition)
And ton of messages:
WARN Attempting to send response via channel for which there is no open connection, connection IP:9092-IP:45520-1 (kafka.network.Processor)
WARN [ReplicaFetcher replicaId=1010, leaderId=1006, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1010, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={topic-312=(offset=984223132, logStartOffset=916099079, maxBytes=1048576)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1467131318, epoch=5031)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 1006 was disconnected before the response was read
        at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
        at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
 
 
I’ve also tried topics with no replication even for __consumer_offsets but result was the same. The only setting which makes difference is less nr of partitions, so if changed from 500 to 200 it is more stable but still 5m spike exists. 
I’ve played around with multiple settings and issue persist no matter what.
 
I would be grateful if anyone can comment on cpu spikes and shed some light how to fix/improve it.
 
Thanks,
Jacek

Re: Kafka CPU spikes every 5 minutes

Posted by Jacek Szewczyk <ja...@gmail.com>.
Never mind, I found answer. I had unexpected cron on firing up every 5 minutes and blast cluster with connections from +2k additional servers.


> On Apr 8, 2020, at 10:46, Jacek Szewczyk <ja...@gmail.com> wrote:
> 
> Hi All,
>  
> I am seeing strange behavior for Kafka 2.0.0.3.1.4. My cluster contains 9 brokers + 3 dedicated zookeepers and for unknown reason there is spike in CPU every 5 minutes which cause timeouts between producers, consumers and brokers. Basically every 5 minutes CPU spikes to 90+ % and at the same time network utilization goes down to almost 0 (and should be in 100MBps range)
> Each broker has 64G of memory, Heap set to 9G, there are 8 cores and 4G uplink. I have 500 partitions (replication=2)  and there were around 1000 producers sending data in 1 minute intervals. Summary input rate is around 600k/s
> Here is my config:
>  
> auto.create.topics.enable=false
> auto.leader.rebalance.enable=true
> compression.type=producer
> controlled.shutdown.enable=true
> controlled.shutdown.max.retries=3
> controlled.shutdown.retry.backoff.ms=5000
> controller.message.queue.size=10
> controller.socket.timeout.ms=30000
> default.replication.factor=2
> delete.topic.enable=true
> leader.imbalance.check.interval.seconds=300
> leader.imbalance.per.broker.percentage=10
> listeners=PLAINTEXT://localhost:9092 <plaintext://localhost:9092>
> log.cleanup.interval.mins=10
> log.dirs=/diskc/kafka-logs,/diskd/kafka-logs,/diske/kafka-logs,/diskf/kafka-logs,/diskg/kafka-logs,/diskh/kafka-logs,/diskj/kafka-logs,/diskk/kafka-logs
> log.index.interval.bytes=4096
> log.index.size.max.bytes=10485760
> log.retention.bytes=-1
> log.retention.check.interval.ms=600000
> log.retention.hours=24
> log.roll.hours=24
> log.segment.bytes=1073741824
> message.max.bytes=1000000
> min.insync.replicas=1
> num.io.threads=8
> num.network.threads=3000
> num.partitions=100
> num.recovery.threads.per.data.dir=4
> num.replica.fetchers=4
> offset.metadata.max.bytes=4096
> offsets.commit.required.acks=-1
> offsets.commit.timeout.ms=5000
> offsets.load.buffer.size=5242880
> offsets.retention.check.interval.ms=600000
> offsets.retention.minutes=86400000
> offsets.topic.compression.codec=0
> offsets.topic.num.partitions=50
> offsets.topic.replication.factor=3
> offsets.topic.segment.bytes=104857600
> producer.metrics.enable=false
> producer.purgatory.purge.interval.requests=10000
> queued.max.requests=500
> replica.fetch.max.bytes=1048576
> replica.fetch.min.bytes=1
> replica.fetch.wait.max.ms=500
> replica.high.watermark.checkpoint.interval.ms=5000
> replica.lag.max.messages=4000
> replica.lag.time.max.ms=10000
> replica.socket.receive.buffer.bytes=65536
> replica.socket.timeout.ms=30000
> sasl.enabled.mechanisms=GSSAPI
> sasl.mechanism.inter.broker.protocol=GSSAPI
> security.inter.broker.protocol=PLAINTEXT
> socket.receive.buffer.bytes=102400
> socket.request.max.bytes=104857600
> socket.send.buffer.bytes=102400
> zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
> zookeeper.connection.timeout.ms=25000
> zookeeper.session.timeout.ms=30000
> zookeeper.sync.time.ms=2000
>  
> In log messages for every spike it starts with shrinking replication and continues with timeouts like this:
> INFO [Partition partition-220 broker=1010] Shrinking ISR from 1010,1006 to 1010 (kafka.cluster.Partition)
> And ton of messages:
> WARN Attempting to send response via channel for which there is no open connection, connection IP:9092-IP:45520-1 (kafka.network.Processor)
> WARN [ReplicaFetcher replicaId=1010, leaderId=1006, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1010, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={topic-312=(offset=984223132, logStartOffset=916099079, maxBytes=1048576)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1467131318, epoch=5031)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 1006 was disconnected before the response was read
>         at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
>         at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
>         at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:240)
>         at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
>         at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:149)
>         at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:114)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
> I’ve also tried topics with no replication even for __consumer_offsets but result was the same. The only setting which makes difference is less nr of partitions, so if changed from 500 to 200 it is more stable but still 5m spike exists. 
> I’ve played around with multiple settings and issue persist no matter what.
>  
> I would be grateful if anyone can comment on cpu spikes and shed some light how to fix/improve it.
>  
> Thanks,
> Jacek