You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2020/11/03 13:33:12 UTC
[GitHub] [kafka] junrao commented on a change in pull request #5346: Kafka 6432: make index lookup more cache friendly

junrao commented on a change in pull request #5346:
URL: https://github.com/apache/kafka/pull/5346#discussion_r516177126



##########
File path: core/src/main/scala/kafka/log/AbstractIndex.scala
##########
@@ -44,9 +44,67 @@ abstract class AbstractIndex[K, V](@volatile var file: File, val baseOffset: Lon
   // Length of the index file
   @volatile
   private var _length: Long = _
-
   protected def entrySize: Int
 
+  /*
+   Kafka mmaps index files into memory, and all the read / write operations of the index is through OS page cache. This
+   avoids blocked disk I/O in most cases.
+
+   To the extent of our knowledge, all the modern operating systems use LRU policy or its variants to manage page
+   cache. Kafka always appends to the end of the index file, and almost all the index lookups (typically from in-sync
+   followers or consumers) are very close to the end of the index. So, the LRU cache replacement policy should work very
+   well with Kafka's index access pattern.
+
+   However, when looking up index, the standard binary search algorithm is not cache friendly, and can cause unnecessary
+   page faults (the thread is blocked to wait for reading some index entries from hard disk, as those entries are not
+   cached in the page cache).
+
+   For example, in an index with 13 pages, to lookup an entry in the last page (page #12), the standard binary search
+   algorithm will read index entries in page #0, 6, 9, 11, and 12.
+   page number: |0|1|2|3|4|5|6|7|8|9|10|11|12 |
+   steps:       |1| | | | | |3| | |4|  |5 |2/6|
+   In each page, there are hundreds log entries, corresponding to hundreds to thousands of kafka messages. When the
+   index gradually growing from the 1st entry in page #12 to the last entry in page #12, all the write (append)
+   operations are in page #12, and all the in-sync follower / consumer lookups read page #0,6,9,11,12. As these pages
+   are always used in each in-sync lookup, we can assume these pages are fairly recently used, and are very likely to be
+   in the page cache. When the index grows to page #13, the pages needed in a in-sync lookup change to #0, 7, 10, 12,
+   and 13:
+   page number: |0|1|2|3|4|5|6|7|8|9|10|11|12|13 |
+   steps:       |1| | | | | | |3| | | 4|5 | 6|2/7|
+   Page #7 and page #10 have not been used for a very long time. They are much less likely to be in the page cache, than
+   the other pages. The 1st lookup, after the 1st index entry in page #13 is appended, is likely to have to read page #7
+   and page #10 from disk (page fault), which can take up to more than a second. In our test, this can cause the
+   at-least-once produce latency to jump to about 1 second from a few ms.
+
+   Here, we use a more cache-friendly lookup algorithm:
+   if (target > indexEntry[end - N]) // if the target is in the last N entries of the index
+      binarySearch(end - N, end)
+   else
+      binarySearch(begin, end - N)
+
+   If possible, we only look up in the last N entries of the index. By choosing a proper constant N, all the in-sync
+   lookups should go to the 1st branch. We call the last N entries the "warm" section. As we frequently look up in this
+   relatively small section, the pages containing this section are more likely to be in the page cache.
+
+   We set N (_warmEntries) to 8192, because
+   1. This number is small enough to guarantee all the pages of the "warm" section is touched in every warm-section
+      lookup. So that, the entire warm section is really "warm".
+      When doing warm-section lookup, following 3 entries are always touched: indexEntry(end), indexEntry(end-N),
+      and indexEntry((end*2 -N)/2). If page size >= 4096, all the warm-section pages (3 or fewer) are touched, when we
+      touch those 3 entries. As of 2018, 4096 is the smallest page size for all the processors (x86-32, x86-64, MIPS,
+      SPARC, Power, ARM etc.).
+   2. This number is large enough to guarantee most of the in-sync lookups are in the warm-section. With default Kafka
+      settings, 8KB index corresponds to about 4MB (offset index) or 2.7MB (time index) log messages.

Review comment:
       An offset index has 8 bytes per entry. So, 8KB offset index has 1K entries. Since by default, we add an index entry for every 4KB of messages, 8KB offset index corresponds to 1K * 4KB = 4MB of messages.
   
   A time index has 12 bytes per entry. So, the same calculation leads to 2.7MB of messages.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org