You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jay Kreps (JIRA)" <ji...@apache.org> on 2012/11/01 00:53:11 UTC
[jira] [Commented] (KAFKA-545) Add a Performance Suite for the Log subsystem

    [ https://issues.apache.org/jira/browse/KAFKA-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13488366#comment-13488366 ] 

Jay Kreps commented on KAFKA-545:
---------------------------------

The initial set of results show the following:

Big impacts on the write path are 
1. flush
2. FileChannel overhead
3. CRC calculation

Big impacts on the read path are
1. FileMessageSet.searchFor
2. MessageSet iteration (mostly an artifact of the test)

Notable is that the index lookup doesn't show up at all.

Since our read path is an order of magnitude faster than our write path, it makes sense to focus on writes first. The most important thing here is to handle flush better.

The important thing to know is that on Linux fsync holds a global lock on writes to the file, so effectively flush blocks all appends (even though we only really want to flush the data already written). We are effectively targeting linux here since it is so common.

Originally we thought that we could just disable flush and depend on replication for durability with the flush to disk happening via Linux's background pdflush process. However recovery demands a "known stable point" from which to recover. Currently we guarantee that all but the last log segment have been flushed. The problem with this is that disabling the time or size based flush effectively just prolongs the time of the global lock until the segment is full, but under high load this could mean suddenly blocking on writing out 2GB of data--a huge multi-second latency spike.

There are a number of possible solutions to this dilemma:
1. Memory map the FileMessageSet. It is possible that the locking characteristics for mmap/msync are different than fsync. This would have the advantage of also getting rid of the overhead of the write call which in java is pretty high and thus making FileMessageSet.searchFor and iterator much faster. There are two disadvantages of this. First we would have to pre-allocate file segments. This would likely confuse people a bit. We are already doing it with index files, but those are smaller. Second memory map eats up process address space. This would likely mean that running on a 32 bit OS would be infeasible (since you would only have 2GB).
2. Change the recovery to recover from an arbitrary point in the log, and write out a "high water mark" in a similar way to how we do for the ISR. This would let us just avoid syncing the active segment.
3. Move the flush into a background thread. This wouldn't help if one was using a time-based or size-based flush but would help for the flush at the time of segment roll since at that point there is guaranteed to be no more writes on the segment. This would require recovering the last two segments. This is kind of hacky but easy to implement.

So there are two questions this raises:
1. Does mmap/msync actually have different locking than fsync? Mmap might be the preferable solution since it solves lots of performance bottlenecks all at once.
2. Does fsync block reads? This is a question Jun raised. We don't actually care to much about adding a little bit of latency to consumers, but with synchronous replication fetch latency is a lower bound on produce latency. So blocking reads may be as bad as blocking writes. It is unlikely that an in-sync replica would be a full segment behind, but nonetheless.

I started to investigate these two questions.

I started with (2). It doesn't seem that fsync blocks reads. This is intuitive. To test this I wrote a test that has one thread continually append to a file, one thread call fsync every 2 seconds, and one thread do random reads. I measured the max read time over 100k reads. Here is a trace of the output:
flushing
flush completed in 0.35281
0.226049
1.347807
0.02298
1.562114
0.041638
2.119723
0.027258
5.329019
flushing
8.32156
flush completed in 81.123215
0.04145
1.473818
0.06444
1.733412
0.050216
1.437777
0.04984
1.612728
0.026001
1.858957
0.041096
0.390903
flushing
12.527883
flush completed in 67.416953
0.055656
1.450987
0.029861
1.469376
0.047733
1.313674
0.024264
1.71214
0.027112
0.023717
1.228905
0.029688
1.215998
flushing
flush completed in 59.289193
18.217726
1.549095
0.029295
1.367316
0.047124
1.389574
0.034049
0.030982
1.129182
0.048443
1.070381
0.040149
1.07179
flushing
flush completed in 59.340792
18.668898
0.933095
0.041071
1.197376
0.035512
1.34228
0.042432
0.024397
0.835786
0.026552
1.496774
0.036751
1.148597
flushing
flush completed in 59.308117
16.34416
0.854841
0.053005
1.013405
0.08081
0.051634
1.218344
0.015086
1.447114
0.019883
1.128675
0.041854
1.148591
flushing
15.110585
flush completed in 47.303732
1.018977
0.015041
0.036324
1.293796
0.051184
1.291538
0.013544
1.211112
0.014241
1.520512
0.027815
1.246593
flushing
0.016121
flush completed in 59.38031
22.635984
0.051233
0.054701
0.712837
0.01345
1.004364
0.017261
1.216081
0.019825
^C[jkreps@jkreps-ld kafka-jbod]$ java -server -Xmx128M -Xms128M -XX:+UseConcMarkSweepGC -cp project/boot/scala-2.8.0/lib/scala-library.jar:core/target/scala_2.8.0/test-classes kafka.TestFileChannelReadLocking 100000
flushing
flush completed in 0.528026
0.217155
1.620644
0.034775
0.032913
1.867401
0.026142
2.367951
0.05301
flushing
3.941371
flush completed in 252.11533
17.678379
1.226629
0.043844
2.254627
0.05084
1.654637
0.028077
0.029319
1.217127
0.029779
1.251271
0.281076
0.919776
flushing
0.050014
1.156282
flush completed in 238.852776
5.953432
0.038438
0.966883
0.048641
0.907416
0.037052
1.595778
0.023821
0.923264
0.047909
0.921312
0.058346
0.058062
flushing
0.906503
0.090477
flush completed in 239.282906
1.504453
0.014805
1.276596
0.051536
1.200947
0.052367
0.068161
1.585333
0.051904
1.052337
0.063165
1.502294
0.01266
flushing
0.839178
0.048978
flush completed in 290.738005
1.414586
0.069402
0.056503
1.0008
0.050265
0.955949
0.050486
1.014454
0.048935
1.210959
0.054599
1.313663
0.058152
flushing
0.062286
1.036941
flush completed in 242.879275
11.401024
1.390022
0.07374
0.92633
0.013332
1.015606
0.04448
1.687692
0.014552
0.018272
1.339258
0.051723
^C

As you can see there is some locking happening. But it is not for the duration of the flush. I tried varying the amount of data being written and the max read time remains constant. My guess is that what is happening is that the locking is at the page level, which is what we see with pdflush. This should be acceptable as the latency is bounded to the time to flush one page regardless of the flush size.

I am working on testing mmap.
                
> Add a Performance Suite for the Log subsystem
> ---------------------------------------------
>
>                 Key: KAFKA-545
>                 URL: https://issues.apache.org/jira/browse/KAFKA-545
>             Project: Kafka
>          Issue Type: New Feature
>    Affects Versions: 0.8
>            Reporter: Jay Kreps
>            Priority: Blocker
>              Labels: features
>         Attachments: KAFKA-545-draft.patch
>
>
> We have had several performance concerns or potential improvements for the logging subsystem. To conduct these in a data-driven way, it would be good to have a single-machine performance test that isolated the performance of the log.
> The performance optimizations we would like to evaluate include
> - Special casing appends in a follower which already have the correct offset to avoid decompression and recompression
> - Memory mapping either all or some of the segment files to improve the performance of small appends and lookups
> - Supporting multiple data directories and avoiding RAID
> Having a standalone tool is nice to isolate the component and makes profiling more intelligible.
> This test would drive load against Log/LogManager controlled by a set of command line options. These command line program could then be scripted up into a suite of tests that covered variations in message size, message set size, compression, number of partitions, etc.
> Here is a proposed usage for the tool:
> ./bin/kafka-log-perf-test.sh
> Option                   Description                            
> ------                       -----------                            
> --partitions             The number of partitions to write to
> --dir                       The directory in which to write the log
> --message-size      The size of the messages
> --set-size               The number of messages per write
> --compression        Compression alg
> --messages            The number of messages to write
> --readers                The number of reader threads reading the data
> The tool would capture latency and throughput for the append() and read() operations.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira