You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Prateek Agarwal (Jira)" <ji...@apache.org> on 2020/12/06 07:31:00 UTC

[jira] [Comment Edited] (KAFKA-6689) Kafka not release .deleted file.

    [ https://issues.apache.org/jira/browse/KAFKA-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244673#comment-17244673 ] 

Prateek Agarwal edited comment on KAFKA-6689 at 12/6/20, 7:30 AM:
------------------------------------------------------------------

We are observing the same pattern when we upgraded to JDK 11. (We have also added {{jdk.unsupported}} module and verified that it's been invoked correctly).

We see a gradual rise in open fd mostly because of accumulation of deleted index and timeindex files.

Surprising thing is that the FD count remained flat when we used JDK 8. So, it might be related to how JAVA 9 [unsafe.invokeCleaner|https://github.com/apache/kafka/blame/trunk/clients/src/main/java/org/apache/kafka/common/utils/ByteBufferUnmapper.java#L12] is behaving.
{code:java}
...
java    126846 kafka  DEL    REG               8,17            98784300039 /var/kafka-spool/data/foo/00000000000059913719.index.deleted
java    126846 kafka  DEL    REG               8,17            64424811556 /var/kafka-spool/data/foo/00000000000059913710.timeindex.deleted 
...{code}
These show up in {{lsof}} but are not physically present on the cluster.

And the accumulated Open FDs suddenly drop when Concurrent (Mixed) GC kicks in. This is shown below:

!Screenshot 2020-12-06 at 12.53.20 PM.png|width=468,height=158!

I saw an explanation for this in  [this mail chain|http://mail-archives.apache.org/mod_mbox/kafka-users/201503.mbox/%3C54F40DEA.4020009@gmail.com%3E] but what's looks buggy is that it started happening only after JDK11 deployment.

[~ijuma] Have you seen any such observations, while your testing of {{Unmap}} for Java11. Is it expected that few of the ".[time]index.deleted" files descriptors might remain dangling till a mixedGC kicks in JDK 11?


was (Author: prat0318):
We are observing the same pattern when we upgraded to JDK 11. (We have also added {{jdk.unsupported}} module and verified that it's been invoked correctly).

We see a gradual rise in open fd mostly because of accumulation of deleted index and timeindex files.

Surprising thing is that the FD count remained flat when we used JDK 8. So, it might be related to how JAVA 9 [unsafe.invokeCleaner|https://github.com/apache/kafka/blame/trunk/clients/src/main/java/org/apache/kafka/common/utils/ByteBufferUnmapper.java#L12] is behaving.
{code:java}
...
java    126846 kafka  DEL    REG               8,17            98784300039 /var/kafka-spool/data/foo/00000000000059913719.index.deleted
java    126846 kafka  DEL    REG               8,17            64424811556 /var/kafka-spool/data/foo/00000000000059913710.timeindex.deleted 
...{code}
These show up in {{lsof}} but are not physically present on the cluster.

And the accumulated Open FDs suddenly drop when Concurrent (Mixed) GC kicks in. This is shown below:

!Screenshot 2020-12-06 at 12.53.20 PM.png|width=468,height=158!

I saw an explanation for this in  [this mail chain|http://mail-archives.apache.org/mod_mbox/kafka-users/201503.mbox/%3C54F40DEA.4020009@gmail.com%3E] but what's looks buggy is that it started happening only after JDK11 deployment.

[~ijuma] Have you seen any such observations, while your testing of {{Unmap}} for Java11. Is it expected that few of the ".[time]index.deleted" files descriptors might remain dangling will a mixedGC kicks in JDK 11?

> Kafka not release .deleted file.
> --------------------------------
>
>                 Key: KAFKA-6689
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6689
>             Project: Kafka
>          Issue Type: Bug
>          Components: config, controller, log
>    Affects Versions: 0.10.1.1
>         Environment: 3 Kafka Broker running on CentOS6.9(rungi3 VMs) 
>            Reporter: A
>            Priority: Critical
>              Labels: newbie
>             Fix For: 0.10.1.1
>
>         Attachments: Screenshot 2020-12-06 at 12.53.20 PM.png
>
>
>          After Kafka cleaned log  .timeindex / .index files based on topic retention. I can
> still lsof a lot of .index.deleted and .timeindex.deleted files. 
>        We have 3 Brokers on 3 VMs ,  It's happen only 2 brokers.  
> [brokeer-01 ~]$ lsof -p 24324 | grep deleted | wc -l
> 28
> [broker-02 ~]$ lsof -p 12131 | grep deleted | wc -l
> 14599
> [broker-03 ~]$ lsof -p 3349 | grep deleted | wc -l
> 14523
>  
> Configuration on 3 broker is same.  (Rolllog every hour, Retention time 11 Hours)
>  * INFO KafkaConfig values:
>  advertised.host.name = null
>  advertised.listeners = PLAINTEXT://Broker-02:9092
>  advertised.port = null
>  authorizer.class.name =
>  auto.create.topics.enable = true
>  auto.leader.rebalance.enable = true
>  background.threads = 10
>  broker.id = 2
>  broker.id.generation.enable = true
>  broker.rack = null
>  compression.type = producer
>  connections.max.idle.ms = 600000
>  controlled.shutdown.enable = true
>  controlled.shutdown.max.retries = 3
>  controlled.shutdown.retry.backoff.ms = 5000
>  controller.socket.timeout.ms = 30000
>  default.replication.factor = 3
>  delete.topic.enable = true
>  fetch.purgatory.purge.interval.requests = 1000
>  group.max.session.timeout.ms = 300000
>  group.min.session.timeout.ms = 6000
>  host.name =
>  inter.broker.protocol.version = 0.10.1-IV2
>  leader.imbalance.check.interval.seconds = 300
>  leader.imbalance.per.broker.percentage = 10
>  listeners = null
>  log.cleaner.backoff.ms = 15000
>  log.cleaner.dedupe.buffer.size = 134217728
>  log.cleaner.delete.retention.ms = 86400000
>  log.cleaner.enable = true
>  log.cleaner.io.buffer.load.factor = 0.9
>  log.cleaner.io.buffer.size = 524288
>  log.cleaner.io.max.bytes.per.second = 1.7976931348623157E308
>  log.cleaner.min.cleanable.ratio = 0.5
>  log.cleaner.min.compaction.lag.ms = 0
>  log.cleaner.threads = 1
>  log.cleanup.policy = [delete]
>  log.dir = /tmp/kafka-logs
>  log.dirs = /data/appdata/kafka/data
>  log.flush.interval.messages = 9223372036854775807
>  log.flush.interval.ms = null
>  log.flush.offset.checkpoint.interval.ms = 60000
>  log.flush.scheduler.interval.ms = 9223372036854775807
>  log.index.interval.bytes = 4096
>  log.index.size.max.bytes = 10485760
>  log.message.format.version = 0.10.1-IV2
>  log.message.timestamp.difference.max.ms = 9223372036854775807
>  log.message.timestamp.type = CreateTime
>  log.preallocate = false
>  log.retention.bytes = -1
>  log.retention.check.interval.ms = 300000
>  log.retention.hours = 11
>  log.retention.minutes = 660
>  log.retention.ms = 39600000
>  log.roll.hours = 1
>  log.roll.jitter.hours = 0
>  log.roll.jitter.ms = null
>  log.roll.ms = null
>  log.segment.bytes = 1073741824
>  log.segment.delete.delay.ms = 60000
>  max.connections.per.ip = 2147483647
>  max.connections.per.ip.overrides =
>  message.max.bytes = 1000012
>  metric.reporters = []
>  metrics.num.samples = 2
>  metrics.sample.window.ms = 30000
>  min.insync.replicas = 2
>  num.io.threads = 16
>  num.network.threads = 16
>  num.partitions = 10
>  num.recovery.threads.per.data.dir = 3
>  num.replica.fetchers = 1
>  offset.metadata.max.bytes = 4096
>  offsets.commit.required.acks = -1
>  offsets.commit.timeout.ms = 5000
>  offsets.load.buffer.size = 5242880
>  offsets.retention.check.interval.ms = 600000
>  offsets.retention.minutes = 1440
>  offsets.topic.compression.codec = 0
>  offsets.topic.num.partitions = 50
>  offsets.topic.replication.factor = 3
>  offsets.topic.segment.bytes = 104857600
>  port = 9092
>  principal.builder.class = class org.apache.kafka.common.security.auth.DefaultPrincipalBuilder
>  producer.purgatory.purge.interval.requests = 1000
>  queued.max.requests = 30000
>  quota.consumer.default = 9223372036854775807
>  quota.producer.default = 9223372036854775807
>  quota.window.num = 11
>  quota.window.size.seconds = 1
>  replica.fetch.backoff.ms = 1000
>  replica.fetch.max.bytes = 1048576
>  replica.fetch.min.bytes = 1
>  replica.fetch.response.max.bytes = 10485760
>  replica.fetch.wait.max.ms = 500
>  replica.high.watermark.checkpoint.interval.ms = 5000
>  replica.lag.time.max.ms = 10000
>  replica.socket.receive.buffer.bytes = 65536
>  replica.socket.timeout.ms = 30000
>  replication.quota.window.num = 11
>  replication.quota.window.size.seconds = 1
>  request.timeout.ms = 30000
>  reserved.broker.max.id = 1000
>  sasl.enabled.mechanisms = [GSSAPI]
>  sasl.kerberos.kinit.cmd = /usr/bin/kinit
>  sasl.kerberos.min.time.before.relogin = 60000
>  sasl.kerberos.principal.to.local.rules = [DEFAULT]
>  sasl.kerberos.service.name = null
>  sasl.kerberos.ticket.renew.jitter = 0.05
>  sasl.kerberos.ticket.renew.window.factor = 0.8
>  sasl.mechanism.inter.broker.protocol = GSSAPI
>  security.inter.broker.protocol = PLAINTEXT
>  socket.receive.buffer.bytes = 102400
>  socket.request.max.bytes = 104857600
>  socket.send.buffer.bytes = 102400
>  ssl.cipher.suites = null
>  ssl.client.auth = none
>  ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
>  ssl.endpoint.identification.algorithm = null
>  ssl.key.password = null
>  ssl.keymanager.algorithm = SunX509
>  ssl.keystore.location = null
>  ssl.keystore.password = null
>  ssl.keystore.type = JKS
>  ssl.protocol = TLS
>  ssl.provider = null
>  ssl.secure.random.implementation = null
>  ssl.trustmanager.algorithm = PKIX
>  ssl.truststore.location = null
>  ssl.truststore.password = null
>  ssl.truststore.type = JKS
>  unclean.leader.election.enable = true
>  zookeeper.connect = 10.95.148.119:2181,10.95.148.120:2181,10.95.148.121:2181
>  zookeeper.connection.timeout.ms = 6000
>  zookeeper.session.timeout.ms = 6000
>  zookeeper.set.acl = false
>  zookeeper.sync.time.ms = 2000
>  (kafka.server.KafkaConfig)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)