You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Hanlin Liu (JIRA)" <ji...@apache.org> on 2017/06/23 10:42:00 UTC
[jira] [Commented] (KAFKA-3039) Temporary loss of leader resulted
in log being completely truncated
[ https://issues.apache.org/jira/browse/KAFKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060711#comment-16060711 ]
Hanlin Liu commented on KAFKA-3039:
-----------------------------------
We are seeing similar issue when restarting KAFKA 0.10.2.0
```
[2017-06-23 05:50:35,082] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index) has non-zero size but the last offset is 74220 which is no larger than the base offset 74220.}. deleting /var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.timeindex, /var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index and rebuilding index... (kafka.log.Log)
[2017-06-23 05:50:35,091] INFO Recovering unflushed segment 74220 in log __consumer_offsets-22. (kafka.log.Log)
[2017-06-23 05:50:35,095] INFO Completed load of log __consumer_offsets-22 with 2 log segments and log end offset 75825 in 35 ms (kafka.log.Log)
[2017-06-23 06:14:57,133] TRACE Controller 1 epoch 1 sending UpdateMetadata request (Leader:2,ISR:2,3,1,LeaderEpoch:0,ControllerEpoch:1) to brokers Set(1, 2, 3) for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,482] TRACE Broker 1 handling LeaderAndIsr request correlationId 7 from controller 1 epoch 1 starting the become-follower transition for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,587] INFO Partition [__consumer_offsets,22] on broker 1: No checkpointed highwatermark is found for partition __consumer_offsets-22 (kafka.cluster.Partition)
[2017-06-23 06:14:57,606] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions __consumer_offsets-28,__consumer_offsets-10,__consumer_offsets-32,__consumer_offsets-14,__consumer_offsets-40,__consumer_offsets-37,__consumer_offsets-22,__consumer_offsets-41,__consumer_offsets-4,__consumer_offsets-23,__consumer_offsets-26,__consumer_offsets-8,__consumer_offsets-49,__consumer_offsets-31,__consumer_offsets-13,__consumer_offsets-35,__consumer_offsets-17,__consumer_offsets-43,__consumer_offsets-25,__consumer_offsets-44,__consumer_offsets-47,__consumer_offsets-7,__consumer_offsets-29,__consumer_offsets-11,__consumer_offsets-34,__consumer_offsets-19,__consumer_offsets-16,__consumer_offsets-38,__consumer_offsets-1,__consumer_offsets-20,__consumer_offsets-5,__consumer_offsets-46,__consumer_offsets-2 (kafka.server.ReplicaFetcherManager)
[2017-06-23 06:14:57,606] TRACE Broker 1 stopped fetchers as part of become-follower request from controller 1 epoch 1 with correlation id 7 for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,608] INFO The cleaning for partition __consumer_offsets-22 is aborted and paused (kafka.log.LogCleaner)
[2017-06-23 06:14:57,608] INFO Truncating log __consumer_offsets-22 to offset 0. (kafka.log.Log)
[2017-06-23 06:14:57,609] INFO Scheduling log segment 42456 for log __consumer_offsets-22 for deletion. (kafka.log.Log)
[2017-06-23 06:14:57,609] INFO Scheduling log segment 74220 for log __consumer_offsets-22 for deletion. (kafka.log.Log)
[2017-06-23 06:14:57,611] INFO Compaction for partition __consumer_offsets-22 is resumed (kafka.log.LogCleaner)
[2017-06-23 06:14:57,904] TRACE Broker 1 truncated logs and checkpointed recovery boundaries for partition __consumer_offsets-22 as part of become-follower request with correlation id 7 from controller 1 epoch 1 (state.change.logger)
[2017-06-23 06:14:57,910] INFO [ReplicaFetcherManager on broker 1] Added fetcher for partitions List([__consumer_offsets-22, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-8, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-4, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-7, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-46, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-25, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-35, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-41, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-23, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-49, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-47, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-16, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-28, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-31, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-37, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-38, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-17, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-19, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-11, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-13, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-2, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-43, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-14, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-20, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-44, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-1, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-5, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-26, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-29, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-34, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-10, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] , [__consumer_offsets-32, initOffset 0 to broker BrokerEndPoint(3,192.168.1.214,9094)] , [__consumer_offsets-40, initOffset 0 to broker BrokerEndPoint(2,192.168.1.214,9093)] ) (kafka.server.ReplicaFetcherManager)
[2017-06-23 06:14:57,910] TRACE Broker 1 started fetcher to new leader as part of become-follower request from controller 1 epoch 1 with correlation id 7 for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,911] TRACE Broker 1 completed LeaderAndIsr request correlationId 7 from controller 1 epoch 1 for the become-follower transition for partition __consumer_offsets-22 (state.change.logger)
[2017-06-23 06:14:57,922] TRACE Broker 1 cached leader info (LeaderAndIsrInfo:(Leader:2,ISR:2,3,1,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:3),AllReplicas:1,2,3) for partition __consumer_offsets-22 in response to UpdateMetadata request sent by controller 1 epoch 1 with correlation id 8 (state.change.logger)
[2017-06-23 06:15:57,610] INFO Deleting segment 42456 from log __consumer_offsets-22. (kafka.log.Log)
[2017-06-23 06:15:57,610] INFO Deleting segment 74220 from log __consumer_offsets-22. (kafka.log.Log)
[2017-06-23 06:15:57,612] INFO Deleting index /var/lib/kafka/data/__consumer_offsets-22/00000000000000042456.index.deleted (kafka.log.OffsetIndex)
[2017-06-23 06:15:57,613] INFO Deleting index /var/lib/kafka/data/__consumer_offsets-22/00000000000000042456.timeindex.deleted (kafka.log.TimeIndex)
[2017-06-23 06:15:58,311] INFO Deleting index /var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.index.deleted (kafka.log.OffsetIndex)
[2017-06-23 06:15:58,912] INFO Deleting index /var/lib/kafka/data/__consumer_offsets-22/00000000000000074220.timeindex.deleted (kafka.log.TimeIndex)
```
INFO KafkaConfig values:
advertised.host.name = null
advertised.listeners = PLAINTEXT://192.168.1.214:9092
advertised.port = null
authorizer.class.name =
auto.create.topics.enable = true
auto.leader.rebalance.enable = true
background.threads = 10
broker.id = 1
broker.id.generation.enable = true
broker.rack = null
compression.type = producer
connections.max.idle.ms = 600000
controlled.shutdown.enable = true
controlled.shutdown.max.retries = 3
controlled.shutdown.retry.backoff.ms = 5000
controller.socket.timeout.ms = 30000
create.topic.policy.class.name = null
default.replication.factor = 3
delete.topic.enable = false
fetch.purgatory.purge.interval.requests = 1000
group.max.session.timeout.ms = 300000
group.min.session.timeout.ms = 6000
host.name =
inter.broker.listener.name = null
inter.broker.protocol.version = 0.10.2-IV0
leader.imbalance.check.interval.seconds = 300
leader.imbalance.per.broker.percentage = 10
listener.security.protocol.map = SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,TRACE:TRACE,SASL_SSL:SASL_SSL,PLAINTEXT:PLAINTEXT
listeners = PLAINTEXT://0.0.0.0:9092
log.cleaner.backoff.ms = 15000
log.cleaner.dedupe.buffer.size = 134217728
log.cleaner.delete.retention.ms = 86400000
log.cleaner.enable = true
log.cleaner.io.buffer.load.factor = 0.9
log.cleaner.io.buffer.size = 524288
log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
log.cleaner.min.cleanable.ratio = 0.5
log.cleaner.min.compaction.lag.ms = 0
log.cleaner.threads = 1
log.cleanup.policy = [delete]
log.dir = /tmp/kafka-logs
log.dirs = /var/lib/kafka/data
log.flush.interval.messages = 9223372036854775807
log.flush.interval.ms = null
log.flush.offset.checkpoint.interval.ms = 60000
log.flush.scheduler.interval.ms = 9223372036854775807
log.index.interval.bytes = 4096
log.index.size.max.bytes = 10485760
log.message.format.version = 0.10.2-IV0
log.message.timestamp.difference.max.ms = 9223372036854775807
log.message.timestamp.type = CreateTime
log.preallocate = false
log.retention.bytes = 10000000000
log.retention.check.interval.ms = 300000
log.retention.hours = 168
log.retention.minutes = null
log.retention.ms = null
log.roll.hours = 168
log.roll.jitter.hours = 0
log.roll.jitter.ms = null
log.roll.ms = null
log.segment.bytes = 1073741824
log.segment.delete.delay.ms = 60000
max.connections.per.ip = 2147483647
max.connections.per.ip.overrides =
message.max.bytes = 2000000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
min.insync.replicas = 1
num.io.threads = 8
num.network.threads = 3
num.partitions = 1
num.recovery.threads.per.data.dir = 1
num.replica.fetchers = 1
offset.metadata.max.bytes = 4096
offsets.commit.required.acks = -1
offsets.commit.timeout.ms = 5000
offsets.load.buffer.size = 5242880
offsets.retention.check.interval.ms = 600000
offsets.retention.minutes = 1440
offsets.topic.compression.codec = 0
offsets.topic.num.partitions = 50
offsets.topic.replication.factor = 3
offsets.topic.segment.bytes = 104857600
port = 9092
principal.builder.class = class org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
producer.purgatory.purge.interval.requests = 1000
queued.max.requests = 500
quota.consumer.default = 9223372036854775807
quota.producer.default = 9223372036854775807
quota.window.num = 11
quota.window.size.seconds = 1
replica.fetch.backoff.ms = 1000
replica.fetch.max.bytes = 3000000
replica.fetch.min.bytes = 1
replica.fetch.response.max.bytes = 10485760
replica.fetch.wait.max.ms = 500
replica.high.watermark.checkpoint.interval.ms = 5000
replica.lag.time.max.ms = 10000
replica.socket.receive.buffer.bytes = 65536
replica.socket.timeout.ms = 30000
replication.quota.window.num = 11
replication.quota.window.size.seconds = 1
request.timeout.ms = 30000
reserved.broker.max.id = 1000
sasl.enabled.mechanisms = [GSSAPI]
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.principal.to.local.rules = [DEFAULT]
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.mechanism.inter.broker.protocol = GSSAPI
security.inter.broker.protocol = PLAINTEXT
socket.receive.buffer.bytes = 102400
socket.request.max.bytes = 104857600
socket.send.buffer.bytes = 102400
ssl.cipher.suites = null
ssl.client.auth = none
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.endpoint.identification.algorithm = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
unclean.leader.election.enable = true
zookeeper.connect = zk1:2181,zk2:2181,zk3:2181
zookeeper.connection.timeout.ms = null
zookeeper.session.timeout.ms = 6000
zookeeper.set.acl = false
zookeeper.sync.time.ms = 2000
> Temporary loss of leader resulted in log being completely truncated
> -------------------------------------------------------------------
>
> Key: KAFKA-3039
> URL: https://issues.apache.org/jira/browse/KAFKA-3039
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 0.9.0.0
> Environment: Debian 3.2.54-2 x86_64 GNU/Linux
> Reporter: Imran Patel
> Priority: Critical
> Labels: reliability
>
> We had an event recently where the temporarily loss of a leader for a partition (during a manual restart), resulted in the leader coming back with no high watermark state and truncating its log to zero. Logs (attached below) indicate that it did have the data but not the commit state. How is this possible?
> Leader (broker 3)
> [2015-12-18 21:19:44,666] INFO Completed load of log messages-14 with log end offset 14175963374 (kafka.log.Log)
> [2015-12-18 21:19:45,170] INFO Partition [messages,14] on broker 3: No checkpointed highwatermark is found for partition [messages,14] (kafka.cluster.Partition)
> [2015-12-18 21:19:45,238] INFO Truncating log messages-14 to offset 0. (kafka.log.Log)
> [2015-12-18 21:20:34,066] INFO Partition [messages,14] on broker 3: Expanding ISR for partition [messages,14] from 3 to 3,10 (kafka.cluster.Partition)
> Replica (broker 10)
> [2015-12-18 21:19:19,525] INFO Partition [messages,14] on broker 10: Shrinking ISR for partition [messages,14] from 3,10,4 to 10,4 (kafka.cluster.Partition)
> [2015-12-18 21:20:34,049] ERROR [ReplicaFetcherThread-0-3], Current offset 14175984203 for partition [messages,14] out of range; reset offset to 35977 (kafka.server.ReplicaFetcherThread)
> [2015-12-18 21:20:34,033] WARN [ReplicaFetcherThread-0-3], Replica 10 for partition [messages,14] reset its fetch offset from 14175984203 to current leader 3's latest offset 35977 (kafka.server.ReplicaFetcherThread)
> Some relevant config parameters:
> offsets.topic.replication.factor = 3
> offsets.commit.required.acks = -1
> replica.high.watermark.checkpoint.interval.ms = 5000
> unclean.leader.election.enable = false
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)