You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by 조용준 <yj...@youhost.co.kr> on 2023/05/04 04:50:00 UTC

kafka broker io Exceptions on AKS

0 i have a IOException on kafka-broker kafka-broker on AKS self-installation. AKS pvc storage is StorageV2 (General Purpose V2) Azure storage 3 brokers 1 zookeeper(planning 2 more) logs and data path is /mnt (not /tmp ) kafka broker version is apache kafka_2.13-3.3.2 not using kafka-connect or streams only using consumer and producer no kerberos and any other auth(including ssl) its only one broker have problem. other 2 brokers are fine i tried re-boot the broker re installing the broker kafka-broker and zookeeper clear and re setting change data and log path /tmp to /mnt when i ask to Azure support they said no issues with the storage and also availability was 100%. They suggested that the problem may be related to Kafka's concurrency issue. so what can i do or how to search? If you need any additional information, please let me know. my logs and config ############################# Socket Server Settings ############################# listeners =... advertised.listeners =... listener.security.protocol.map =... inter.broker.listener.name =... num.network.threads =3 num.io.threads =8 socket.send.buffer.bytes =102400 socket.receive.buffer.bytes =102400 socket.request.max.bytes =104857600 ############################# Log Basics ############################# log.dirs =/mnt/kafka-logs/ num.partitions =1 num.recovery.threads.per.data.dir =1 ############################# Internal Topic Settings ############################# offsets.topic.replication.factor =3 transaction.state.log.replication.factor =3 transaction.state.log.min.isr =3 delete.topic.enable =true allow.auto.create.topics =false offset.retention.minutes =4320 ############################# Log Flush Policy ############################# ############################# Log Retention Policy ############################# log.retention.hours =48 log.segment.bytes =1073741824 log.retention.check.interval.ms =300000 ############################# Zookeeper ############################# zookeeper.connect =zk-cluster.zookeeper4kafka.svc.cluster.local:2181 zookeeper.connection.timeout.ms =18000 ############################# Group Coordinator Settings ############################# group.initial.rebalance.delay.ms =0 [ 2023-04-30 16:19:11 , 582 ] ERROR Error while reading checkpoint file /mnt/kafka-logs/cleaner-offset-checkpoint (kafka.server.LogDirFailureChannel) java.nio.file.FileSystemException: /mnt/kafka-logs/cleaner-offset-checkpoint: Resource temporarily unavailable at java.base/sun.nio.fs.UnixException.translateToIOException(Unknown Source) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(Unknown Source) at java.base/java.nio.file.Files.newByteChannel(Unknown Source) at java.base/java.nio.file.Files.newByteChannel(Unknown Source) at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(Unknown Source) at java.base/java.nio.file.Files.newInputStream(Unknown Source) at java.base/java.nio.file.Files.newBufferedReader(Unknown Source) at java.base/java.nio.file.Files.newBufferedReader(Unknown Source) at org.apache.kafka.server.common.CheckpointFile.read(CheckpointFile.java:104) at kafka.server.checkpoints.CheckpointFileWithFailureHandler.read(CheckpointFileWithFailureHandler.scala:48) at kafka.server.checkpoints.OffsetCheckpointFile.read(OffsetCheckpointFile.scala:70) at kafka.log.LogCleanerManager.$anonfun$allCleanerCheckpoints$2(LogCleanerManager.scala:136) at scala.collection.Iterator$$anon$10.nextCur(Iterator.scala:587) [ 2023-04-30 16:19:37 , 901 ] WARN [ ReplicaManager broker=2 ] Stopping serving replicas in dir /mnt/kafka-logs (kafka.server.ReplicaManager) [ 2023-04-30 16:19:37 , 906 ] INFO [ ReplicaFetcherManager on broker 2 ] Removed fetcher for partitions HashSet(topic_partitions_names) (kafka.server.ReplicaFetcherManager) [ 2023-04-30 16:19:37 , 907 ] INFO [ ReplicaAlterLogDirsManager on broker 2 ] Removed fetcher for partitions HashSet(topic_partitions_names) (kafka.server.ReplicaAlterLogDirsManager) [ 2023-04-30 16:19:37 , 919 ] WARN [ ReplicaManager broker=2 ] Broker 2 stopped fetcher for partitions topic_partitions_names and stopped moving logs for partitions because they are in the failed log directory /mnt/kafka-logs. (kafka.server.ReplicaManager) [ 2023-04-30 16:19:37 , 919 ] WARN Stopping serving logs in dir /mnt/kafka-logs (kafka.log.LogManager) [ 2023-04-30 16:19:37 , 928 ] ERROR Shutdown broker because all log dirs in /mnt/kafka-logs have failed (kafka.log.LogManager) [ 2023-04-30 19:47:17 , 548 ] INFO [ LocalLog partition=__consumer_offsets-31 , dir=/mnt/kafka-logs ] Rolled new log segment at offset 43468272 in 142 ms. (kafka.log.LocalLog) [ 2023-04-30 19:47:17 , 602 ] INFO [ ProducerStateManager partition=__consumer_offsets-31 ] Wrote producer snapshot at offset 43468272 with 0 producer ids in 20 ms. (kafka.log.ProducerStateManager) [ 2023-04-30 19:47:17 , 620 ] ERROR Error while flushing log for __consumer_offsets-31 in dir /mnt/kafka-logs with offset 43468272 (exclusive) and recovery point 43468272 (kafka.server.LogDirFailureChannel) java.io.IOException: Bad file descriptor at java.base/java.nio.MappedByteBuffer.force0(Native Method) at java.base/java.nio.MappedByteBuffer.force(Unknown Source) at kafka.log.AbstractIndex.$anonfun$flush$1(AbstractIndex.scala:219) at kafka.log.AbstractIndex.flush(AbstractIndex.scala:219) at kafka.log.LogSegment.$anonfun$flush$1(LogSegment.scala:469) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.log.LogSegment.flush(LogSegment.scala:467) [ 2023-04-30 19:47:17 , 621 ] WARN [ ReplicaManager broker=2 ] Stopping serving replicas in dir /mnt/kafka-logs (kafka.server.ReplicaManager) [ 2023-04-30 19:47:17 , 621 ] ERROR Uncaught exception in scheduled task 'flush-log' (kafka.utils.KafkaScheduler) org.apache.kafka.common.errors.KafkaStorageException: Error while flushing log for __consumer_offsets-31 in dir /mnt/kafka-logs with offset 43468272 (exclusive) and recovery point 43468272 Caused by: java.io.IOException: Bad file descriptor at java.base/java.nio.MappedByteBuffer.force0(Native Method) at java.base/java.nio.MappedByteBuffer.force(Unknown Source) at kafka.log.AbstractIndex.$anonfun$flush$1(AbstractIndex.scala:219) at kafka.log.AbstractIndex.flush(AbstractIndex.scala:219) at kafka.log.LogSegment.$anonfun$flush$1(LogSegment.scala:469) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) at kafka.log.LogSegment.flush(LogSegment.scala:467) [ 2023-04-30 19:47:17 , 703 ] ERROR [ ReplicaFetcher replicaId=2 , leaderId=0 , fetcherId=0 ] Error while processing data for partition __consumer_offsets-31 at offset 43468272 (kafka.server.ReplicaFetcherThread) org.apache.kafka.common.errors.KafkaStorageException: The log dir /mnt/kafka-logs is already offline due to a previous IO exception. [ 2023-04-30 19:47:17 , 704 ] WARN [ ReplicaFetcher replicaId=2 , leaderId=0 , fetcherId=0 ] Partition __consumer_offsets-31 marked as failed (kafka.server.ReplicaFetcherThread) [ 2023-04-30 19:47:17 , 704 ] ERROR [ ReplicaFetcher replicaId=2 , leaderId=0 , fetcherId=0 ] Error while processing data for partition topic-name-1 at offset 1449760 (kafka.server.ReplicaFetcherThread) org.apache.kafka.common.errors.KafkaStorageException: The log dir /mnt/kafka-logs is already offline due to a previous IO exception. [ 2023-04-30 19:47:17 , 704 ] WARN [ ReplicaFetcher replicaId=2 , leaderId=0 , fetcherId=0 ] Partition topic-name-1 marked as failed (kafka.server.ReplicaFetcherThread) [ 2023-04-30 19:47:17 , 705 ] INFO [ ReplicaFetcherManager on broker 2 ] Removed fetcher for partitions HashSet(topic_partitions_names...) (kafka.server.ReplicaFetcherManager) [ 2023-04-30 19:47:17 , 706 ] INFO [ ReplicaAlterLogDirsManager on broker 2 ] Removed fetcher for partitions HashSet(topic_partitions_names....) (kafka.server.ReplicaAlterLogDirsManager) [ 2023-04-30 19:47:17 , 720 ] WARN [ ReplicaManager broker=2 ] Broker 2 stopped fetcher for partitions topic_partitions_names... and stopped moving logs for partitions because they are in the failed log directory /mnt/kafka-logs. (kafka.server.ReplicaManager) [ 2023-04-30 19:47:17 , 720 ] WARN Stopping serving logs in dir /mnt/kafka-logs (kafka.log.LogManager) [ 2023-04-30 19:47:17 , 722 ] ERROR Shutdown broker because all log dirs in /mnt/kafka-logs have failed (kafka.log.LogManager) thanks.