You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "novosibman (via GitHub)" <gi...@apache.org> on 2023/05/26 19:50:58 UTC

[GitHub] [kafka] novosibman opened a new pull request, #13768: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll

novosibman opened a new pull request, #13768:
URL: https://github.com/apache/kafka/pull/13768

   Related issue https://issues.apache.org/jira/browse/KAFKA-9693
   
   The issue with repeating latency spikes during Kafka log segments rolling still reproduced on the latest versions including  kafka_2.13-3.4.0.
   
   It was found that flushing Kafka snapshot file during segments rolling blocks producer request handling thread for some time:
   https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/ProducerStateManager.scala#L452
   ```
     private def writeSnapshot(file: File, entries: mutable.Map[Long, ProducerStateEntry]): Unit = {
   ...
       val fileChannel = FileChannel.open(file.toPath, StandardOpenOption.CREATE, StandardOpenOption.WRITE)
       try {
         fileChannel.write(buffer)
         fileChannel.force(true)     <- here
       } finally {
         fileChannel.close()
       }...
   
   ```
   More partitions - more cumulative latency effect observed.
   
   Suggested fix offloads flush (fileChannel.force) operation to the background thread similar to (but not exactly) how it was done in the UnifiedLog.scala:
   ```
     def roll(
      ...
       // Schedule an asynchronous flush of the old segment
       scheduler.schedule("flush-log", () => flushUptoOffsetExclusive(newSegment.baseOffset))
     }
   ```
   The benchmarking using this fix shows significant reduction in repeating latency spikes:
   
   test config: 
   AWS
   3 node cluster (i3en.2xlarge)
   zulu11.62.17-ca-jdk11.0.18-linux_x64, heap 6G per broker
   1 loadgen (m5n.8xlarge) - OpenMessaging benchmark ([OMB](https://github.com/openmessaging/benchmark)) 
   1  zookeeper (t2.small)
   acks=all batchSize=1048510 consumers=4 insyncReplicas=2 lingerMs=1 mlen=1024 producers=4 rf=3 subscriptions=1 targetRate=200k time=12m topics=1 warmup=1m 
   
   ### variation 1:
   partitions=10 
   <span ng-repeat="item in report" ng-show="!item.visible || item.visible()" ng-class="item.cls + (item.sticky ? 'sticky' : '')" class="ng-scope"><span ng-if="item.table &amp;&amp; item.displayTable.show" ng-class="item.cls" class="ng-scope">
   
   metric | kafka_2.13-3.4.0 | kafka_2.13-3.4.0 patched
   -- | -- | --
   endToEnd service_time (ms) p50 max | 2.00 | 2.00 
   endToEnd service_time (ms) p75 max | 3.00 | 2.00 
   endToEnd service_time (ms) p95 max | 94.0 | 3.00 
   endToEnd service_time (ms) p99 max | 290 | 6.00  
   endToEnd service_time (ms) p99.9 max | 355 | 21.0
   endToEnd service_time (ms) p99.99 max | 372 | 34.0
   endToEnd service_time (ms) p100 max | 374 | 36.0 
   publish service_time (ms) p50 max | 1.70 | 1.67 
   publish service_time (ms) p75 max | 2.23 | 2.09 
   publish service_time (ms) p95 max | 90.7 | 2.82 
   publish service_time (ms) p99 max | 287 | 4.69  
   publish service_time (ms) p99.9 max | 353 | 19.6
   publish service_time (ms) p99.99 max | 369 | 31.3
   publish service_time (ms) p100 max | 371 | 33.5 
   
   kafka | endToEnd chart
   -- | --
   kafka_2.13-3.4.0 | ![image](https://github.com/apache/kafka/assets/6793713/ec329711-47d4-459f-92d7-06310b770023) 
   kafka_2.13-3.4.0 patched | ![image](https://github.com/apache/kafka/assets/6793713/e5aeb6a6-d33a-4d57-be80-c916fa1f05be)
   
   latency score improved up to 10x times in high percentiles ^^^, spikes almost invisible
   
   ### variation 2:
   partitions=100
   metric | kafka_2.13-3.4.0 | kafka_2.13-3.4.0 patched
   -- | -- | --
   endToEnd service_time (ms) p50 max | 91.0 | 2.00 
   endToEnd service_time (ms) p75 max | 358 | 3.00 
   endToEnd service_time (ms) p95 max | 1814 | 4.00
   endToEnd service_time (ms) p99 max | 2777 | 21.0
   endToEnd service_time (ms) p99.9 max | 3643 | 119
   endToEnd service_time (ms) p99.99 max | 3724 | 141
   endToEnd service_time (ms) p100 max | 3726 | 143  
   publish service_time (ms) p50 max | 77.4 | 1.92 
   publish service_time (ms) p75 max | 352 | 2.35  
   publish service_time (ms) p95 max | 1748 | 3.80
   publish service_time (ms) p99 max | 2740 | 18.9
   publish service_time (ms) p99.9 max | 3619 | 116
   publish service_time (ms) p99.99 max | 3720 | 139
   publish service_time (ms) p100 max | 3722 | 141
   endToEnd service_time  
   
   kafka | endToEnd chart
   -- | --
   kafka_2.13-3.4.0 | ![image](https://github.com/apache/kafka/assets/6793713/dc6e9820-c3b7-4bd0-8dac-bce9f3886d91) 
   kafka_2.13-3.4.0 patched | ![image](https://github.com/apache/kafka/assets/6793713/113b4480-97a4-4dd5-8d5c-f7dc87a3d7a5)
   
   latency score improved up to 25x times in high percentiles ^^^
   
   The fix was done for 3.4 branch  - scala version of ProducerStateManager. Trunk needs corresponding fix for ProducerStateManager.java.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] jolshan commented on pull request #13768: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll

Posted by "jolshan (via GitHub)" <gi...@apache.org>.
jolshan commented on PR #13768:
URL: https://github.com/apache/kafka/pull/13768#issuecomment-1565102014

   Thanks for the PR! This looks promising. As Ismael said, let's share in trunk first. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll [kafka]

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya closed pull request #13768: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll
URL: https://github.com/apache/kafka/pull/13768


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] zhangsm6 commented on pull request #13768: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll

Posted by "zhangsm6 (via GitHub)" <gi...@apache.org>.
zhangsm6 commented on PR #13768:
URL: https://github.com/apache/kafka/pull/13768#issuecomment-1568759449

   @novosibman What is the file system used in your test? OMB default should be `xfs` but wondering if it was changed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on pull request #13768: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on PR #13768:
URL: https://github.com/apache/kafka/pull/13768#issuecomment-1568910054

   > We typically make changes to master first. Would you be willing to submit a PR for that instead?
   
   Prepared and tested trunk version: https://github.com/apache/kafka/pull/13782


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on pull request #13768: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on PR #13768:
URL: https://github.com/apache/kafka/pull/13768#issuecomment-1568829724

   > @novosibman What is the file system used in your test? OMB default should be `xfs` but wondering if it was changed
   
   Yes, used `xfs`. This fs type was found in OMB settings. 
   Older experiments showed about 10x worse in high percentiles when using `ext4` :
   ![image](https://github.com/apache/kafka/assets/6793713/442ac661-53c4-4f23-a364-0e4b6057a3a9)
   
   //  the test is quite short not including Kafka log segments rolling


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] github-actions[bot] commented on pull request #13768: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #13768:
URL: https://github.com/apache/kafka/pull/13768#issuecomment-1696716168

   This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has  merge conflicts, please update it with the latest from trunk (or appropriate release branch) <p> If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] ijuma commented on pull request #13768: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll

Posted by "ijuma (via GitHub)" <gi...@apache.org>.
ijuma commented on PR #13768:
URL: https://github.com/apache/kafka/pull/13768#issuecomment-1564903842

   We typically make changes to master first. Would you be willing to submit a PR for that instead?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll [kafka]

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya commented on PR #13768:
URL: https://github.com/apache/kafka/pull/13768#issuecomment-1923926345

   This PR is fixed in trunk (scheduled for release in 3.7.0). Currently there are no plans of backporting this to earlier versions since this is a performance optimization and not a critical bug fix. I am closing this PR, please feel free to reopen if you think we still need to backport this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org