You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Yiming Liu <li...@gmail.com> on 2015/09/16 11:44:00 UTC

Why my brokers delete the whole partition data(150G) in one second, and recover later by replication?

Hello, Kafka Expert:

I met some strange problem recently. Some topic may delete the whole
partition data in one second, and recover later by replication. The the
example of topic "ubt.mobilemonitor". It has 30 partitions, with
replication 2. The retention.bytes is 157374182400(150G). The default
segment is *1073741824(1G). *

*I want to know the reason, why the broker was scheduled to remove all
segments files. It happened two or three times a week. The deleted
partitions were random. No regular pattern found.*

My Kafka Cluster has 16 broker nodes(8Core24G3T/per node), with 5 zookeeper
nodes cluster. Have already created 18 topics, with the following broker
setting:

server.properties
*############################# Server Basics #############################*
*broker.id <http://broker.id>=10811396*

*port=9092*
*host.name <http://host.name>=10.8.113.96*

*log.dirs=/data01/kafka,/data02/kafka,/data03/kafka,/data04/kafka,/data05/kafka,/data06/kafka*

*# Replication configurations*
*num.replica.fetchers=4*
*replica.fetch.max.bytes=1048576*
*replica.fetch.wait.max.ms <http://replica.fetch.wait.max.ms>=500*
*replica.high.watermark.checkpoint.interval.ms
<http://replica.high.watermark.checkpoint.interval.ms>=5000*
*replica.socket.timeout.ms <http://replica.socket.timeout.ms>=30000*
*replica.socket.receive.buffer.bytes=65536*
*replica.lag.time.max.ms <http://replica.lag.time.max.ms>=10000*
*replica.lag.max.messages=4000*

*controller.socket.timeout.ms <http://controller.socket.timeout.ms>=30000*
*controller.message.queue.size=10*

*# Log configuration*
*num.partitions=8*
*message.max.bytes=1000000*
*auto.create.topics.enable=false*
*delete.topic.enable=true*
*log.index.interval.bytes=4096*
*log.index.size.max.bytes=10485760*
*log.retention.hours=168*
*log.retention.bytes=10737418240*
*log.flush.interval.ms <http://log.flush.interval.ms>=10000*
*log.flush.interval.messages=20000*
*log.flush.scheduler.interval.ms
<http://log.flush.scheduler.interval.ms>=2000*
*log.roll.hours=168*
*log.retention.check.interval.ms
<http://log.retention.check.interval.ms>=300000*
*log.segment.bytes=1073741824*

*# ZK configuration*
*zookeeper.connect=10.8.113.113:2181
<http://10.8.113.113:2181>,10.8.113.114:2181
<http://10.8.113.114:2181>,10.8.113.115:2181
<http://10.8.113.115:2181>,10.8.113.116:2181
<http://10.8.113.116:2181>,10.8.113.117:2181 <http://10.8.113.117:2181>*
*zookeeper.connection.timeout.ms
<http://zookeeper.connection.timeout.ms>=6000*
*zookeeper.sync.time.ms <http://zookeeper.sync.time.ms>=2000*

*# Socket server configuration*
*num.io.threads=8*
*num.network.threads=8*
*socket.request.max.bytes=104857600*
*socket.receive.buffer.bytes=1048576*
*socket.send.buffer.bytes=1048576*
*queued.max.requests=16*
*fetch.purgatory.purge.interval.requests=100*
*producer.purgatory.purge.interval.requests=100*

The following is the server.log.

*[From 10:31:20,136 to 10:31:20,158, the low showed 'Scheduling log
segment' for all segments file under partition 2]*
*[2015-09-16 10:31:20,136] INFO Scheduling log segment 323671365 for log
ubt.mobilemonitor-2 for deletion. (kafka.log.Log)*
*[2015-09-16 10:31:20,143] INFO Scheduling log segment 324603130 for log
ubt.mobilemonitor-2 for deletion. (kafka.log.Log)*
*[2015-09-16 10:31:20,143] INFO Scheduling log segment 325534373 for log
ubt.mobilemonitor-2 for deletion. (kafka.log.Log)*
*.....*
*[2015-09-16 10:31:20,158] INFO Scheduling log segment 452156179 for log
ubt.mobilemonitor-2 for deletion. (kafka.log.Log)*
*[2015-09-16 10:31:20,158] INFO Scheduling log segment 453167436 for log
ubt.mobilemonitor-2 for deletion. (kafka.log.Log)*

*[Broker found the offset is not in sync with master anymore, it shrinked
the ISR]*
*[2015-09-16 10:31:20,192] WARN [ReplicaFetcherThread-1-1088254], Replica
10811396 for partition [ubt.mobilemonitor,2] reset its fetch offset from
323671365 to current leader 1088254's start offset 323671365
(kafka.server.ReplicaFetcherThread)*
*[2015-09-16 10:31:20,192] ERROR [ReplicaFetcherThread-1-1088254], Current
offset 453840204 for partition [ubt.mobilemonitor,2] out of range; reset
offset to 323671365 (kafka.server.ReplicaFetcherThread)*
*[2015-09-16 10:31:22,367] INFO Partition [ubt.mobilemonitor,2] on broker
1088254: Shrinking ISR for partition [ubt.mobilemonitor,2] from
1088254,10811396 to 1088254 (kafka.cluster.Partition)*

*[Producer was still writing data, and segments were deleted at the same
time]*
*[2015-09-16 10:31:52,502] INFO Rolled new log segment for
'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)*
*[2015-09-16 10:32:20,143] INFO Deleting segment 323671365 from log
ubt.mobilemonitor-2. (kafka.log.Log)*
*[2015-09-16 10:32:20,143] INFO Deleting segment 324603130 from log
ubt.mobilemonitor-2. (kafka.log.Log)*
*[2015-09-16 10:32:20,144] INFO Deleting segment 326466801 from log
ubt.mobilemonitor-2. (kafka.log.Log)*
*[2015-09-16 10:32:20,144] INFO Deleting segment 329263345 from log
ubt.mobilemonitor-2. (kafka.log.Log)*
*[2015-09-16 10:32:20,144] INFO Deleting segment 327398756 from log
ubt.mobilemonitor-2. (kafka.log.Log)*
*[2015-09-16 10:32:20,144] INFO Deleting segment 330194386 from log
ubt.mobilemonitor-2. (kafka.log.Log)*
*[2015-09-16 10:32:20,144] INFO Deleting segment 325534373 from log
ubt.mobilemonitor-2. (kafka.log.Log)*
*[2015-09-16 10:32:20,144] INFO Deleting segment 328330912 from log
ubt.mobilemonitor-2. (kafka.log.Log)*
*[2015-09-16 10:32:20,144] INFO Deleting segment 331126657 from log
ubt.mobilemonitor-2. (kafka.log.Log) *
*.....*
*[index files were deleted]*
*[2015-09-16 10:32:20,360] INFO Deleting index
/data05/kafka/ubt.mobilemonitor-2/00000000000324603130.index.deleted
(kafka.log.OffsetIndex)*
*[2015-09-16 10:32:20,448] INFO Deleting index
/data05/kafka/ubt.mobilemonitor-2/00000000000325534373.index.deleted
(kafka.log.OffsetIndex)*
*[2015-09-16 10:32:20,478] INFO Deleting index
/data05/kafka/ubt.mobilemonitor-2/00000000000326466801.index.deleted
(kafka.log.OffsetIndex)*
*.....*
*[FetchThread started to fetch replicas]*
*[2015-09-16 10:33:02,021] INFO Rolled new log segment for
'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)*
*[2015-09-16 10:33:34,441] INFO Rolled new log segment for
'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)*
*[2015-09-16 10:34:05,636] INFO Rolled new log segment for
'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)*
*[2015-09-16 10:34:36,266] INFO Rolled new log segment for
'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)*
*[2015-09-16 10:35:07,829] INFO Rolled new log segment for
'ubt.mobilemonitor-2' in 1 ms. (kafka.log.Log)*

*[From 10:33:02 to 11:46:03, all segments were recovered, and ISR expended]*
*[2015-09-16 11:46:03,679] INFO Partition [ubt.mobilemonitor,2] on broker
1088254: Expanding ISR for partition [ubt.mobilemonitor,2] from 1088254 to
1088254,10811396 (kafka.cluster.Partition) *


-- 
With Warm regards

Yiming Liu (刘一鸣)