You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2018/01/05 03:08:00 UTC

[jira] [Commented] (KAFKA-6423) Slow shutdown with many open files

    [ https://issues.apache.org/jira/browse/KAFKA-6423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312423#comment-16312423 ] 

Ted Yu commented on KAFKA-6423:
-------------------------------

From FileRecords.java :
{code}
    public void close() throws IOException {
        flush();
        trim();
        channel.close();
{code}
Did you notice flush() appearing in any stack trace you captured ?

> Slow shutdown with many open files
> ----------------------------------
>
>                 Key: KAFKA-6423
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6423
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.11.0.1
>            Reporter: Ivan Babrou
>
> We have brokers with 20k open files and shutdown is extremely slow, progressing usually at around 60 closed file descriptors per second:
> {noformat}
> $ while true; do echo $(date) $(sudo ls /proc/6363/fd | wc -l); sleep 1; done
> Thu Jan 4 23:00:51 UTC 2018 9770
> Thu Jan 4 23:00:52 UTC 2018 9691
> Thu Jan 4 23:00:53 UTC 2018 9616
> Thu Jan 4 23:00:54 UTC 2018 9561
> Thu Jan 4 23:00:55 UTC 2018 9509
> Thu Jan 4 23:00:56 UTC 2018 9427
> Thu Jan 4 23:00:57 UTC 2018 9350
> Thu Jan 4 23:00:58 UTC 2018 9260
> Thu Jan 4 23:00:59 UTC 2018 9208
> {noformat}
> If you strace the process, you can see:
> {noformat}
> $ sudo strace -f -c -p 6363
> strace: Process 6363 attached with 97 threads
> ^Cstrace: Process 6363 detached
> strace: Process 6604 detached
> strace: Process 6605 detached
> strace: Process 6606 detached
> strace: Process 6607 detached
> strace: Process 6608 detached
> strace: Process 6609 detached
> strace: Process 6610 detached
> strace: Process 6611 detached
> strace: Process 6612 detached
> strace: Process 6613 detached
> strace: Process 6614 detached
> strace: Process 6615 detached
> strace: Process 6616 detached
> strace: Process 6617 detached
> strace: Process 6618 detached
> strace: Process 6619 detached
> strace: Process 6620 detached
> strace: Process 6621 detached
> strace: Process 6622 detached
> strace: Process 6623 detached
> strace: Process 6624 detached
> strace: Process 6625 detached
> strace: Process 6626 detached
> strace: Process 6627 detached
> strace: Process 6628 detached
> strace: Process 6629 detached
> strace: Process 6630 detached
> strace: Process 6631 detached
> strace: Process 6632 detached
> strace: Process 6633 detached
> strace: Process 6634 detached
> strace: Process 6635 detached
> strace: Process 6636 detached
> strace: Process 6637 detached
> strace: Process 6638 detached
> strace: Process 6639 detached
> strace: Process 6640 detached
> strace: Process 6641 detached
> strace: Process 6642 detached
> strace: Process 6643 detached
> strace: Process 6644 detached
> strace: Process 6645 detached
> strace: Process 6646 detached
> strace: Process 6647 detached
> strace: Process 6648 detached
> strace: Process 6649 detached
> strace: Process 6650 detached
> strace: Process 6651 detached
> strace: Process 6652 detached
> strace: Process 6653 detached
> strace: Process 6654 detached
> strace: Process 6655 detached
> strace: Process 6656 detached
> strace: Process 6657 detached
> strace: Process 6658 detached
> strace: Process 6659 detached
> strace: Process 6660 detached
> strace: Process 6661 detached
> strace: Process 6662 detached
> strace: Process 6663 detached
> strace: Process 6716 detached
> strace: Process 6717 detached
> strace: Process 6718 detached
> strace: Process 6719 detached
> strace: Process 6720 detached
> strace: Process 6721 detached
> strace: Process 6722 detached
> strace: Process 6723 detached
> strace: Process 6724 detached
> strace: Process 6725 detached
> strace: Process 6726 detached
> strace: Process 6727 detached
> strace: Process 6728 detached
> strace: Process 6729 detached
> strace: Process 6730 detached
> strace: Process 6731 detached
> strace: Process 6732 detached
> strace: Process 6733 detached
> strace: Process 6734 detached
> strace: Process 6735 detached
> strace: Process 6736 detached
> strace: Process 6737 detached
> strace: Process 6738 detached
> strace: Process 6739 detached
> strace: Process 6740 detached
> strace: Process 6741 detached
> strace: Process 6760 detached
> strace: Process 6779 detached
> strace: Process 6781 detached
> strace: Process 6783 detached
> strace: Process 6892 detached
> strace: Process 2339 detached
> strace: Process 2340 detached
> strace: Process 5122 detached
> strace: Process 5123 detached
> strace: Process 5652 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  65.19    0.859302       35804        24        19 restart_syscall
>  26.60    0.350656         507       692       190 futex
>   5.17    0.068142        2271        30           epoll_wait
>   1.22    0.016141          56       287           ftruncate
>   0.66    0.008679          20       432           close
>   0.38    0.005054          35       144           fsync
>   0.26    0.003489          12       288           open
>   0.19    0.002564           4       720           fstat
>   0.15    0.001914           3       576           lseek
>   0.14    0.001885          13       144           mmap
>   0.01    0.000114           7        16           getrusage
>   0.00    0.000060          15         4           write
>   0.00    0.000028           5         6           read
>   0.00    0.000027           2        12           mprotect
>   0.00    0.000019           5         4           epoll_ctl
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    1.318074                  3379       209 total
> ivan@36s116:~$ htop
> ivan@36s116:~$ sudo strace -f -c -p 6363
> strace: Process 6363 attached with 97 threads
> ^Cstrace: Process 6363 detached
> strace: Process 6604 detached
> strace: Process 6605 detached
> strace: Process 6606 detached
> strace: Process 6607 detached
> strace: Process 6608 detached
> strace: Process 6609 detached
> strace: Process 6610 detached
> strace: Process 6611 detached
> strace: Process 6612 detached
> strace: Process 6613 detached
> strace: Process 6614 detached
> strace: Process 6615 detached
> strace: Process 6616 detached
> strace: Process 6617 detached
> strace: Process 6618 detached
> strace: Process 6619 detached
> strace: Process 6620 detached
> strace: Process 6621 detached
> strace: Process 6622 detached
> strace: Process 6623 detached
> strace: Process 6624 detached
> strace: Process 6625 detached
> strace: Process 6626 detached
> strace: Process 6627 detached
> strace: Process 6628 detached
> strace: Process 6629 detached
> strace: Process 6630 detached
> strace: Process 6631 detached
> strace: Process 6632 detached
> strace: Process 6633 detached
> strace: Process 6634 detached
> strace: Process 6635 detached
> strace: Process 6636 detached
> strace: Process 6637 detached
> strace: Process 6638 detached
> strace: Process 6639 detached
> strace: Process 6640 detached
> strace: Process 6641 detached
> strace: Process 6642 detached
> strace: Process 6643 detached
> strace: Process 6644 detached
> strace: Process 6645 detached
> strace: Process 6646 detached
> strace: Process 6647 detached
> strace: Process 6648 detached
> strace: Process 6649 detached
> strace: Process 6650 detached
> strace: Process 6651 detached
> strace: Process 6652 detached
> strace: Process 6653 detached
> strace: Process 6654 detached
> strace: Process 6655 detached
> strace: Process 6656 detached
> strace: Process 6657 detached
> strace: Process 6658 detached
> strace: Process 6659 detached
> strace: Process 6660 detached
> strace: Process 6661 detached
> strace: Process 6662 detached
> strace: Process 6663 detached
> strace: Process 6716 detached
> strace: Process 6717 detached
> strace: Process 6718 detached
> strace: Process 6719 detached
> strace: Process 6720 detached
> strace: Process 6721 detached
> strace: Process 6722 detached
> strace: Process 6723 detached
> strace: Process 6724 detached
> strace: Process 6725 detached
> strace: Process 6726 detached
> strace: Process 6727 detached
> strace: Process 6728 detached
> strace: Process 6729 detached
> strace: Process 6730 detached
> strace: Process 6731 detached
> strace: Process 6732 detached
> strace: Process 6733 detached
> strace: Process 6734 detached
> strace: Process 6735 detached
> strace: Process 6736 detached
> strace: Process 6737 detached
> strace: Process 6738 detached
> strace: Process 6739 detached
> strace: Process 6740 detached
> strace: Process 6741 detached
> strace: Process 6760 detached
> strace: Process 6779 detached
> strace: Process 6781 detached
> strace: Process 6783 detached
> strace: Process 6892 detached
> strace: Process 2339 detached
> strace: Process 2340 detached
> strace: Process 5122 detached
> strace: Process 5123 detached
> strace: Process 5652 detached
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  61.71   73.863265        9074      8140      3060 futex
>  32.70   39.143004          65    599436           sched_yield
>   3.02    3.610581       16337       221           epoll_wait
>   1.93    2.314673       89026        26         5 restart_syscall
>   0.18    0.211991      211991         1           accept
>   0.13    0.157244          45      3492           ftruncate
>   0.11    0.127528          24      5285           close
>   0.06    0.072592          41      1760           fsync
>   0.04    0.047197          13      3538           open
>   0.03    0.036148           4      8776           fstat
>   0.03    0.035549           6      5715           stat
>   0.02    0.026905         317        85           getdents
>   0.02    0.026150           4      6988         2 lseek
>   0.02    0.025100          14      1742           mmap
>   0.00    0.001132          15        75         2 read
>   0.00    0.000662          14        46           write
>   0.00    0.000639          23        28           msync
>   0.00    0.000580           6       100           getrusage
>   0.00    0.000443           5        95           mprotect
>   0.00    0.000212           6        34           epoll_ctl
>   0.00    0.000099          12         8           sysinfo
>   0.00    0.000050          13         4           times
>   0.00    0.000036           9         4           getrlimit
>   0.00    0.000023          12         2           getsockname
>   0.00    0.000015           8         2           uname
>   0.00    0.000013           4         3           fcntl
>   0.00    0.000013           7         2           sched_getaffinity
>   0.00    0.000007           4         2           setsockopt
>   0.00    0.000000           0         2           writev
>   0.00    0.000000           0         1           dup2
>   0.00    0.000000           0         2         1 shutdown
>   0.00    0.000000           0         2           clock_gettime
> ------ ----------- ----------- --------- --------- ----------------
> 100.00  119.701851                645617      3070 total
> {noformat}
> In jstack I can see:
> {noformat}
> "pool-6-thread-1" #219 prio=5 os_prio=0 tid=0x00007f44d80a2800 nid=0x1614 runnable [0x00007f45690a6000]
>    java.lang.Thread.State: RUNNABLE
> 	at java.io.RandomAccessFile.close0(Native Method)
> 	at java.io.RandomAccessFile.access$000(RandomAccessFile.java:59)
> 	at java.io.RandomAccessFile$1.close(RandomAccessFile.java:619)
> 	at java.io.FileDescriptor.closeAll(FileDescriptor.java:212)
> 	- locked <0x000000064c651c10> (a java.io.FileDescriptor)
> 	at java.io.RandomAccessFile.close(RandomAccessFile.java:617)
> 	at sun.nio.ch.FileChannelImpl.implCloseChannel(FileChannelImpl.java:139)
> 	at java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:115)
> 	- locked <0x000000064c651be0> (a java.lang.Object)
> 	at org.apache.kafka.common.record.FileRecords.close(FileRecords.java:171)
> 	at kafka.log.LogSegment.$anonfun$close$4(LogSegment.scala:490)
> 	at kafka.log.LogSegment$$Lambda$1176/84191269.apply$mcV$sp(Unknown Source)
> 	at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:78)
> 	at kafka.utils.Logging.swallowWarn(Logging.scala:94)
> 	at kafka.utils.Logging.swallowWarn$(Logging.scala:93)
> 	at kafka.utils.CoreUtils$.swallowWarn(CoreUtils.scala:48)
> 	at kafka.utils.Logging.swallow(Logging.scala:96)
> 	at kafka.utils.Logging.swallow$(Logging.scala:96)
> 	at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:48)
> 	at kafka.log.LogSegment.close(LogSegment.scala:490)
> 	at kafka.log.Log.$anonfun$close$2(Log.scala:537)
> 	at kafka.log.Log.$anonfun$close$2$adapted(Log.scala:537)
> 	at kafka.log.Log$$Lambda$1172/1976013682.apply(Unknown Source)
> 	at scala.collection.Iterator.foreach(Iterator.scala:929)
> 	at scala.collection.Iterator.foreach$(Iterator.scala:929)
> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
> 	at scala.collection.IterableLike.foreach(IterableLike.scala:71)
> 	at scala.collection.IterableLike.foreach$(IterableLike.scala:70)
> 	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> 	at kafka.log.Log.close(Log.scala:537)
> 	- locked <0x0000000640488d10> (a java.lang.Object)
> 	at kafka.log.LogManager.$anonfun$shutdown$7(LogManager.scala:280)
> 	at kafka.log.LogManager$$Lambda$1166/165872442.apply$mcV$sp(Unknown Source)
> 	at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {noformat}
> CPU is barely used at ~2%.
> There is some IO, but it's very far from saturating RAID0 from 12 spinning disks:
> https://i.imgur.com/jRuW3pp.png
> This happens across brokers and clusters.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)