You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Hanlin Liu (JIRA)" <ji...@apache.org> on 2017/06/23 10:42:00 UTC

[jira] [Commented] (KAFKA-3039) Temporary loss of leader resulted in log being completely truncated

    [ https://issues.apache.org/jira/browse/KAFKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060711#comment-16060711 ] 

Hanlin Liu commented on KAFKA-3039:
-----------------------------------

We are seeing similar issue when restarting KAFKA 0.10.2.0

```
[2017-06-23 05:50:35,082] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index) has non-zero size but the last offset is 74220 which is no larger than the base offset 74220.}. deleting /var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.timeindex, /var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index and rebuilding index... (kafka.log.Log)
[2017-06-23 05:50:35,091] INFO Recovering unflushed segment 74220 in log __consumer_offsets-22. (kafka.log.Log)
[2017-06-23 05:50:35,095] INFO Completed load of log __consumer_offsets-22 with 2 log segments and log end offset 75825 in 35 ms (kafka.log.Log)
[2017-06-23 06:14:57,133] TRACE Controller 1 epoch 1 sending UpdateMetadata request (Leader:2,ISR:2,3,1,LeaderEpoch:0,ControllerEpoch:1) to brokers Set(1, 2, 3) for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,482] TRACE Broker 1 handling LeaderAndIsr request correlationId 7 from controller 1 epoch 1 starting the become-follower transition for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,587] INFO Partition [__consumer_offsets,22] on broker 1: No checkpointed highwatermark is found for partition __consumer_offsets-22 (kafka.cluster.Partition)
[2017-06-23 06:14:57,606] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions __consumer_offsets-28,__consumer_offsets-10,__consumer_offsets-32,__consumer_offsets-14,__consumer_offsets-40,__consumer_offsets-37,__consumer_offsets-22,__consumer_offsets-41,__consumer_offsets-4,__consumer_offsets-23,__consumer_offsets-26,__consumer_offsets-8,__consumer_offsets-49,__consumer_offsets-31,__consumer_offsets-13,__consumer_offsets-35,__consumer_offsets-17,__consumer_offsets-43,__consumer_offsets-25,__consumer_offsets-44,__consumer_offsets-47,__consumer_offsets-7,__consumer_offsets-29,__consumer_offsets-11,__consumer_offsets-34,__consumer_offsets-19,__consumer_offsets-16,__consumer_offsets-38,__consumer_offsets-1,__consumer_offsets-20,__consumer_offsets-5,__consumer_offsets-46,__consumer_offsets-2 (kafka.server.ReplicaFetcherManager)
[2017-06-23 06:14:57,606] TRACE Broker 1 stopped fetchers as part of become-follower request from controller 1 epoch 1 with correlation id 7 for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,608] INFO The cleaning for partition __consumer_offsets-22 is aborted and paused (kafka.log.LogCleaner)
[2017-06-23 06:14:57,608] INFO Truncating log __consumer_offsets-22 to offset 0. (kafka.log.Log)
[2017-06-23 06:14:57,609] INFO Scheduling log segment 42456 for log __consumer_offsets-22 for deletion. (kafka.log.Log)
[2017-06-23 06:14:57,609] INFO Scheduling log segment 74220 for log __consumer_offsets-22 for deletion. (kafka.log.Log)
[2017-06-23 06:14:57,611] INFO Compaction for partition __consumer_offsets-22 is resumed (kafka.log.LogCleaner)
[2017-06-23 06:14:57,904] TRACE Broker 1 truncated logs and checkpointed recovery boundaries for partition __consumer_offsets-22 as part of become-follower request with correlation id 7 from controller 1 epoch 1 (state.change.logger)
[2017-06-23 06:14:57,910] INFO [ReplicaFetcherManager on broker 1] Added fetcher for partitions List([__consumer_offsets-22, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-8, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-4, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-7, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-46, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-25, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-35, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-41, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-23, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-49, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-47, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-16, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-28, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-31, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-37, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-38, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-17, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-19, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-11, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-13, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-2, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-43, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-14, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-20, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-44, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-1, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-5, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-26, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-29, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-34, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-10, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-32, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-40, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] ) (kafka.server.ReplicaFetcherManager)
[2017-06-23 06:14:57,910] TRACE Broker 1 started fetcher to new leader as part of become-follower request from controller 1 epoch 1 with correlation id 7 for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,911] TRACE Broker 1 completed LeaderAndIsr request correlationId 7 from controller 1 epoch 1 for the become-follower transition for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,922] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:2,ISR:2,3,1,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:3),AllReplicas:1,2,3) for partition __consumer_offsets-22 in response to UpdateMetadata request sent by controller 1 epoch 1 with correlation id 8 (state.change.logger)
[2017-06-23 06:15:57,610] INFO Deleting segment 42456 from log __consumer_offsets-22. (kafka.log.Log)
[2017-06-23 06:15:57,610] INFO Deleting segment 74220 from log __consumer_offsets-22. (kafka.log.Log)
[2017-06-23 06:15:57,612] INFO Deleting index /var/lib/kafka/data/__consumer_offsets-22/00000000000000042456.index.deleted (kafka.log.OffsetIndex)
[2017-06-23 06:15:57,613] INFO Deleting index /var/lib/kafka/data/__consumer_offsets-22/00000000000000042456.timeindex.deleted (kafka.log.TimeIndex)
[2017-06-23 06:15:58,311] INFO Deleting index /var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index.deleted (kafka.log.OffsetIndex)
[2017-06-23 06:15:58,912] INFO Deleting index /var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.timeindex.deleted (kafka.log.TimeIndex)



```
INFO KafkaConfig values: 
	advertised.host.name = null
	advertised.listeners = PLAINTEXT://192.168.1.214:9092
	advertised.port = null
	authorizer.class.name = 
	auto.create.topics.enable = true
	auto.leader.rebalance.enable = true
	background.threads = 10
	broker.id = 1
	broker.id.generation.enable = true
	broker.rack = null
	compression.type = producer
	connections.max.idle.ms = 600000
	controlled.shutdown.enable = true
	controlled.shutdown.max.retries = 3
	controlled.shutdown.retry.backoff.ms = 5000
	controller.socket.timeout.ms = 30000
	create.topic.policy.class.name = null
	default.replication.factor = 3
	delete.topic.enable = false
	fetch.purgatory.purge.interval.requests = 1000
	group.max.session.timeout.ms = 300000
	group.min.session.timeout.ms = 6000
	host.name = 
	inter.broker.listener.name = null
	inter.broker.protocol.version = 0.10.2-IV0
	leader.imbalance.check.interval.seconds = 300
	leader.imbalance.per.broker.percentage = 10
	listener.security.protocol.map = SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,TRACE:TRACE,SASL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT
	listeners = PLAINTEXT://0.0.0.0:9092
	log.cleaner.backoff.ms = 15000
	log.cleaner.dedupe.buffer.size = 134217728
	log.cleaner.delete.retention.ms = 86400000
	log.cleaner.enable = true
	log.cleaner.io.buffer.load.factor = 0.9
	log.cleaner.io.buffer.size = 524288
	log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
	log.cleaner.min.cleanable.ratio = 0.5
	log.cleaner.min.compaction.lag.ms = 0
	log.cleaner.threads = 1
	log.cleanup.policy = [delete]
	log.dir = /tmp/kafka-logs
	log.dirs = /var/lib/kafka/data
	log.flush.interval.messages = 9223372036854775807
	log.flush.interval.ms = null
	log.flush.offset.checkpoint.interval.ms = 60000
	log.flush.scheduler.interval.ms = 9223372036854775807
	log.index.interval.bytes = 4096
	log.index.size.max.bytes = 10485760
	log.message.format.version = 0.10.2-IV0
	log.message.timestamp.difference.max.ms = 9223372036854775807
	log.message.timestamp.type = CreateTime
	log.preallocate = false
	log.retention.bytes = 10000000000
	log.retention.check.interval.ms = 300000
	log.retention.hours = 168
	log.retention.minutes = null
	log.retention.ms = null
	log.roll.hours = 168
	log.roll.jitter.hours = 0
	log.roll.jitter.ms = null
	log.roll.ms = null
	log.segment.bytes = 1073741824
	log.segment.delete.delay.ms = 60000
	max.connections.per.ip = 2147483647
	max.connections.per.ip.overrides = 
	message.max.bytes = 2000000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	min.insync.replicas = 1
	num.io.threads = 8
	num.network.threads = 3
	num.partitions = 1
	num.recovery.threads.per.data.dir = 1
	num.replica.fetchers = 1
	offset.metadata.max.bytes = 4096
	offsets.commit.required.acks = -1
	offsets.commit.timeout.ms = 5000
	offsets.load.buffer.size = 5242880
	offsets.retention.check.interval.ms = 600000
	offsets.retention.minutes = 1440
	offsets.topic.compression.codec = 0
	offsets.topic.num.partitions = 50
	offsets.topic.replication.factor = 3
	offsets.topic.segment.bytes = 104857600
	port = 9092
	principal.builder.class = class org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
	producer.purgatory.purge.interval.requests = 1000
	queued.max.requests = 500
	quota.consumer.default = 9223372036854775807
	quota.producer.default = 9223372036854775807
	quota.window.num = 11
	quota.window.size.seconds = 1
	replica.fetch.backoff.ms = 1000
	replica.fetch.max.bytes = 3000000
	replica.fetch.min.bytes = 1
	replica.fetch.response.max.bytes = 10485760
	replica.fetch.wait.max.ms = 500
	replica.high.watermark.checkpoint.interval.ms = 5000
	replica.lag.time.max.ms = 10000
	replica.socket.receive.buffer.bytes = 65536
	replica.socket.timeout.ms = 30000
	replication.quota.window.num = 11
	replication.quota.window.size.seconds = 1
	request.timeout.ms = 30000
	reserved.broker.max.id = 1000
	sasl.enabled.mechanisms = [GSSAPI]
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.principal.to.local.rules = [DEFAULT]
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.mechanism.inter.broker.protocol = GSSAPI
	security.inter.broker.protocol = PLAINTEXT
	socket.receive.buffer.bytes = 102400
	socket.request.max.bytes = 104857600
	socket.send.buffer.bytes = 102400
	ssl.cipher.suites = null
	ssl.client.auth = none
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = null
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	unclean.leader.election.enable = true
	zookeeper.connect = zk1:2181,zk2:2181,zk3:2181
	zookeeper.connection.timeout.ms = null
	zookeeper.session.timeout.ms = 6000
	zookeeper.set.acl = false
	zookeeper.sync.time.ms = 2000

> Temporary loss of leader resulted in log being completely truncated
> -------------------------------------------------------------------
>
>                 Key: KAFKA-3039
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3039
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.9.0.0
>         Environment: Debian 3.2.54-2 x86_64 GNU/Linux
>            Reporter: Imran Patel
>            Priority: Critical
>              Labels: reliability
>
> We had an event recently where the temporarily loss of a leader for a partition (during a manual restart), resulted in the leader coming back with no high watermark state and truncating its log to zero. Logs (attached below) indicate that it did have the data but not the commit state. How is this possible?
> Leader (broker 3)
> [2015-12-18 21:19:44,666] INFO Completed load of log messages-14 with log end offset 14175963374 (kafka.log.Log)
> [2015-12-18 21:19:45,170] INFO Partition [messages,14] on broker 3: No checkpointed highwatermark is found for partition [messages,14] (kafka.cluster.Partition)
> [2015-12-18 21:19:45,238] INFO Truncating log messages-14 to offset 0. (kafka.log.Log)
> [2015-12-18 21:20:34,066] INFO Partition [messages,14] on broker 3: Expanding ISR for partition [messages,14] from 3 to 3,10 (kafka.cluster.Partition)
> Replica (broker 10)
> [2015-12-18 21:19:19,525] INFO Partition [messages,14] on broker 10: Shrinking ISR for partition [messages,14] from 3,10,4 to 10,4 (kafka.cluster.Partition)
> [2015-12-18 21:20:34,049] ERROR [ReplicaFetcherThread-0-3], Current offset 14175984203 for partition [messages,14] out of range; reset offset to 35977 (kafka.server.ReplicaFetcherThread)
> [2015-12-18 21:20:34,033] WARN [ReplicaFetcherThread-0-3], Replica 10 for partition [messages,14] reset its fetch offset from 14175984203 to current leader 3's latest offset 35977 (kafka.server.ReplicaFetcherThread)
> Some relevant config parameters:
>         offsets.topic.replication.factor = 3
>         offsets.commit.required.acks = -1
>         replica.high.watermark.checkpoint.interval.ms = 5000
>         unclean.leader.election.enable = false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)