You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "wang (JIRA)" <ji...@apache.org> on 2018/07/11 03:41:00 UTC
[jira] [Created] (KAFKA-7148) Kafka load log very slow after goes down with outOfMemoryError

wang created KAFKA-7148:
---------------------------

             Summary: Kafka load log very slow after goes down with outOfMemoryError
                 Key: KAFKA-7148
                 URL: https://issues.apache.org/jira/browse/KAFKA-7148
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 0.10.1.1
            Reporter: wang


two questions

1、is there any idea to avoid the IOException: Map failed, or is there any limit in vm mem size ?

2、is it normal when kafka load one partition log file  cost 20+ secend? 

 

Detail Info:

1、Linux Version ：
 kafka_2.11-0.10.1.1> cat /etc/SuSE-release
 SUSE Linux Enterprise Server 11 (x86_64)
 VERSION = 11
 PATCHLEVEL = 3
 
2、VM INFO ：4C32G

3、java -version
 java version "1.8.0_131"
 Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
 Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

 

4、Start command :

java -Xmx16G -Xms16G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+DisableExplicitGC -Djava.awt.headless=true -Xloggc:/opt/***/kafka_2.11-0.10.1.1/bin/../logs/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -Dkafka.logs.dir=/opt/***/kafka_2.11-0.10.1.1/bin/../logs -Dlog4j.configuration=file:./../config/log4j.properties -cp :………………………………


5、Broker&Topics

we have 3 broker,3 zookeeper
 we have 4 topics in this kafka cluster
 __consumer_offsets 50 partition,3 replicate
 topic1 5 partition,3 replicate
 topic2 160 partition,3 replicate
 topic3 160 partition,3 replicate

Total data disk use :32G
 du -sh data/
 32G data/


6、logs

[2018-07-10 17:23:59,728] FATAL [Replica Manager on Broker 1]: Halting due to unrecoverable I/O error while handling produce request: (kafka.server.ReplicaManager)
kafka.common.KafkaStorageException: I/O exception in append to log '**************-13'
 at kafka.log.Log.append(Log.scala:349)
 at kafka.cluster.Partition$$anonfun$10.apply(Partition.scala:443)
 at kafka.cluster.Partition$$anonfun$10.apply(Partition.scala:429)
 at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:234)
 at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:240)
 at kafka.cluster.Partition.appendMessagesToLeader(Partition.scala:429)
 at kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:407)
 at kafka.server.ReplicaManager$$anonfun$appendToLocalLog$2.apply(ReplicaManager.scala:393)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
 at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
 at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
 at scala.collection.AbstractTraversable.map(Traversable.scala:104)
 at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:393)
 at kafka.server.ReplicaManager.appendMessages(ReplicaManager.scala:330)
 at kafka.server.KafkaApis.handleProducerRequest(KafkaApis.scala:436)
 at kafka.server.KafkaApis.handle(KafkaApis.scala:78)
 at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Map failed
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:940)
 at kafka.log.AbstractIndex.<init>(AbstractIndex.scala:61)
 at kafka.log.OffsetIndex.<init>(OffsetIndex.scala:52)
 at kafka.log.LogSegment.<init>(LogSegment.scala:67)
 at kafka.log.Log.roll(Log.scala:778)
 at kafka.log.Log.maybeRoll(Log.scala:744)
 at kafka.log.Log.append(Log.scala:405)
 ... 22 more
Caused by: java.lang.OutOfMemoryError: Map failed
 at sun.nio.ch.FileChannelImpl.map0(Native Method)
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:937)
 ... 28 more

 

7、then I fllow this （https://issues.apache.org/jira/browse/KAFKA-6165?jql=project%20%3D%20KAFKA%20AND%20text%20~%20%22sun.nio.ch.FileChannelImpl.map%20Map%20failed%22）

add the vm.max_map_count from default value (65536)->262144
 
 /sbin/sysctl -a |grep map 
 vm.max_map_count = 262144
 
 cat /proc/2860/maps |wc -l
 1195

and change the kafka-run-class.sh replaced `-XX:+DisableExplicitGC` with `-XX:+ExplicitGCInvokesConcurrent`


8、but the "IOException: Map failed" problem still exist
 
 then we add vm mem server to 4C64G,and change the -Xmx16G -Xms16G to -Xmx4G -Xms4G


9、load log slow log 
 
 [2018-07-10 17:35:33,481] INFO Completed load of log ***-37 with 2 log segments and log end offset 2441365 in 15678 ms (kafka.log.Log)
 [2018-07-10 17:35:33,484] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/opt/***/data/***-34/00000000000002451611.index) has non-zero size but the last offset is 2451611 which is no larger than the base offset 2451611.}. deleting /opt/***/data/***-34/00000000000002451611.timeindex, /opt/***/data/***-34/00000000000002451611.index and rebuilding index... (kafka.log.Log)
 [2018-07-10 17:35:33,485] INFO Recovering unflushed segment 1086746 in log ***-34. (kafka.log.Log)
 [2018-07-10 17:35:49,070] INFO Recovering unflushed segment 2451611 in log ***-34. (kafka.log.Log)
 [2018-07-10 17:35:49,071] INFO Completed load of log ***-34 with 2 log segments and log end offset 2451613 in 15589 ms (kafka.log.Log)
 [2018-07-10 17:35:49,073] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/opt/***/data/***-31/00000000000002437998.index) has non-zero size but the last offset is 2437998 which is no larger than the base offset 2437998.}. deleting /opt/***/data/***-31/00000000000002437998.timeindex, /opt/***/data/***-31/00000000000002437998.index and rebuilding index... (kafka.log.Log)
 [2018-07-10 17:35:49,090] INFO Recovering unflushed segment 1080802 in log ***-31. (kafka.log.Log)
 [2018-07-10 17:36:04,806] INFO Recovering unflushed segment 2437998 in log ***-31. (kafka.log.Log)
 [2018-07-10 17:36:04,807] INFO Completed load of log ***-31 with 2 log segments and log end offset 2438000 in 15735 ms (kafka.log.Log)
 [2018-07-10 17:36:04,810] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/opt/***/data/***-28/00000000000002432077.index) has non-zero size but the last offset is 2432077 which is no larger than the base offset 2432077.}. deleting /opt/***/data/***-28/00000000000002432077.timeindex, /opt/***/data/***-28/00000000000002432077.index and rebuilding index... (kafka.log.Log)
 [2018-07-10 17:36:04,811] INFO Recovering unflushed segment 1081522 in log ***-28. (kafka.log.Log)
 [2018-07-10 17:36:19,537] INFO Recovering unflushed segment 2432077 in log ***-28. (kafka.log.Log)
 [2018-07-10 17:36:19,538] INFO Completed load of log ***-28 with 2 log segments and log end offset 2432079 in 14730 ms (kafka.log.Log)
 [2018-07-10 17:36:19,539] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/opt/***/data/***-25/00000000000001080046.index) has non-zero size but the last offset is 1080046 which is no larger than the base offset 1080046.}. deleting /opt/***/data/***-25/00000000000001080046.timeindex, /opt/***/data/***-25/00000000000001080046.index and rebuilding index... (kafka.log.Log)
 [2018-07-10 17:36:34,232] INFO Recovering unflushed segment 1080046 in log ***-25. (kafka.log.Log)
 [2018-07-10 17:36:48,261] INFO Completed load of log ***-25 with 1 log segments and log end offset 2435166 in 28721 ms (kafka.log.Log)
 [2018-07-10 17:36:48,266] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/opt/***/data/***-22/00000000000002438178.index) has non-zero size but the last offset is 2438178 which is no larger than the base offset 2438178.}. deleting /opt/***/data/***-22/00000000000002438178.timeindex, /opt/***/data/***-22/00000000000002438178.index and rebuilding index... (kafka.log.Log)
 [2018-07-10 17:36:48,267] INFO Recovering unflushed segment 1081500 in log ***-22. (kafka.log.Log)
 [2018-07-10 17:37:02,803] INFO Recovering unflushed segment 2438178 in log ***-22. (kafka.log.Log)
 [2018-07-10 17:37:02,803] INFO Completed load of log ***-22 with 2 log segments and log end offset 2438180 in 14540 ms (kafka.log.Log)
 [2018-07-10 17:37:02,807] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/opt/***/data/***-19/00000000000001081412.index) has non-zero size but the last offset is 1081412 which is no larger than the base offset 1081412.}. deleting /opt/***/data/***-19/00000000000001081412.timeindex, /opt/***/data/***-19/00000000000001081412.index and rebuilding index... (kafka.log.Log)
 [2018-07-10 17:37:17,536] INFO Recovering unflushed segment 1081412 in log ***-19. (kafka.log.Log)
 [2018-07-10 17:37:31,570] INFO Completed load of log ***-19 with 1 log segments and log end offset 2443668 in 28764 ms (kafka.log.Log)
 [2018-07-10 17:37:31,573] WARN Found a corrupted index file due to requirement failed: Corrupt index found, index file (/opt/***/data/***-35/00000000000002440999.index) has non-zero size but the last offset is 2440999 which is no larger than the base offset 2440999.}. deleting /opt/***/data/***-35/00000000000002440999.timeindex, /opt/***/data/***-35/00000000000002440999.index and rebuilding index... (kafka.log.Log)


10、the I check the cpu ，it's 100% use while load logs ,and use top -H -p and jstack find the busy thread

"pool-3-thread-1" #20 prio=5 os_prio=0 tid=0x00007f7041537000 nid=0x5c3a runnable [0x00007f703042e000]

java.lang.Thread.State: RUNNABLE

at sun.nio.ch.FileDispatcherImpl.pread0(Native Method)

at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52)

at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220)

at sun.nio.ch.IOUtil.read(IOUtil.java:197)

at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:741)

at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:727)

at kafka.log.FileMessageSet$$anon$1.makeNext(FileMessageSet.scala:319)

at kafka.log.FileMessageSet$$anon$1.makeNext(FileMessageSet.scala:308)

at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:64)

at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:56)

at kafka.log.LogSegment.recover(LogSegment.scala:211)

at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:218)

at kafka.log.Log$$anonfun$loadSegments$4.apply(Log.scala:179)

at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)

at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)

at kafka.log.Log.loadSegments(Log.scala:179)

at kafka.log.Log.<init>(Log.scala:108)

at kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$3$$anonfun$apply$10$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:151)

at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:58)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)