You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "novosibman (via GitHub)" <gi...@apache.org> on 2023/05/30 18:43:45 UTC

[GitHub] [kafka] novosibman opened a new pull request, #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

novosibman opened a new pull request, #13782:
URL: https://github.com/apache/kafka/pull/13782

   Trunk version of initial change: https://github.com/apache/kafka/pull/13768 in branch "3.4"
   
   Key difference with branched change:
   Passed and used existing `scheduler` which already is being used for flushing large segment logs and indices.
   
   In all cases snapshot's fileChannel is kept opened when passed to other threads for flushing and closing (so removing try-with-resource in this change).
   
   Related issue https://issues.apache.org/jira/browse/KAFKA-9693
   
   The issue with repeating latency spikes during Kafka log segments rolling still reproduced on the latest versions including kafka_2.13-3.4.0.
   
   It was found that flushing Kafka snapshot file during segments rolling blocks producer request handling thread for some time. Reproduced latency improvement in the kafka_2.13-3.6.0-snapshot by offloading flush operation. Used available on my side single node test configuration:
    kafka_2.13-3.6.0-snapshot - trunk version
    kafka_2.13-3.6.0-snapshot-fix - trunk version with provided change
   
   partitions=10 # rolling at each ~52 seconds
   ![image](https://github.com/apache/kafka/assets/6793713/6f71a515-36d2-4d10-a577-6a8712c2dbf0)
   
   partitions=100 # rolling events about each 8.5 minute:
   ![image](https://github.com/apache/kafka/assets/6793713/a7780840-75e2-4fca-b1a6-7fa17cec702c)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1569655316

   > Many thanks for the patch and the collected data! Really interesting to see the impact of this change. A few questions:
   > 
   >     * What storage device and file system are used in the test?
   
   In AWS config used: i3en.2xlarge with 2 x 2500 NVMe SSDs
   In local lab config: 2 x Samsung_SSD_860_EVO_1TB
   FS type: xfs
   
   The FS format had huge impact on results. Initially we used ext4 in our lab for regular testing:
   some of `ext4` example results:
   ![image](https://github.com/apache/kafka/assets/6793713/3fcbec41-9f91-4ee9-9a0c-0732524aad3b)
   after switched to `xfs`:
   ![image](https://github.com/apache/kafka/assets/6793713/1324d042-2664-4737-af48-cd4a723c914d)
   `ext4`  was much worse before and during Kafka logs rolling
   
   > 
   >     * Would you have a real-life workload where the impact of this change can be quantified? The workload generated by the producer-perf-test.sh exhibits the problem the most because the segments of all replicas on the brokers start rolling at the same time. Which is why it is also interesting to assess the impact using topic-partitions which have different ingress rate and/or use segments of different sizes.
   
   We have no any real-life workload scenarios available for Kafka perf testing. Alternative workload https://github.com/AzulSystems/kafka-benchmark has slightly different rolling behavior compared to OMB:
   
   OMB results example on released kafka_2.13-3.4.0 version (using xfs):
   ![image](https://github.com/apache/kafka/assets/6793713/9b8bf37b-7067-44e7-9e18-f28089af0266)
   
   Kafka Tussle benchmark:
   ![image](https://github.com/apache/kafka/assets/6793713/2b3790df-acf5-4990-9736-56a7eb77e7b8)
   
   # same params used:  acks=1 batchSize=1048510 consumers=4 lingerMs=1 mlen=1024 partitions=100 producers=4 rf=1 targetRate=200k time=30m topics=1 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] KAFKA-9693: Kafka latency spikes caused by log segment flush on roll [kafka]

Posted by "ocadaruma (via GitHub)" <gi...@apache.org>.
ocadaruma commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1923713357

   @divijvaidya Yeah, my understanding is the same


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] showuon commented on a diff in pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "showuon (via GitHub)" <gi...@apache.org>.
showuon commented on code in PR #13782:
URL: https://github.com/apache/kafka/pull/13782#discussion_r1230335416


##########
clients/src/main/java/org/apache/kafka/common/utils/Utils.java:
##########
@@ -1007,6 +1007,30 @@ public static void flushDir(Path path) throws IOException {
         }
     }
 
+    /**
+     * Flushes dirty file to guarantee crash consistency.
+     *
+     * @throws IOException if flushing the file fails.
+     */
+    public static void flushFile(Path path) throws IOException {
+        if (path != null) {
+            try (FileChannel fileChannel = FileChannel.open(path, StandardOpenOption.READ)) {
+                fileChannel.force(true);
+            }
+        }
+    }
+
+    /**
+     * Flushes dirty file quietly, logs warning when exception happens.
+     */
+    public static void flushFileQuietly(Path path, String name) {
+        try {
+            flushFile(path);
+        } catch (IOException e) {
+            log.warn("Failed to flush {} at path {}", name, path);

Review Comment:
   You should still put `e` in the 3rd parameter, like this:
   `log.warn("Failed to flush {} at path {}", name, path, e);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on a diff in pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on code in PR #13782:
URL: https://github.com/apache/kafka/pull/13782#discussion_r1223374923


##########
storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerStateManager.java:
##########
@@ -681,7 +687,12 @@ private static void writeSnapshot(File file, Map<Long, ProducerStateEntry> entri
 
         try (FileChannel fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE)) {
             fileChannel.write(buffer);
-            fileChannel.force(true);
+        }
+
+        if (scheduler != null) {
+            scheduler.scheduleOnce("flush-producer-snapshot", () -> Utils.flushFileQuietly(file.toPath(), "producer-snapshot"));
+        } else {
+            Utils.flushFileQuietly(file.toPath(), "producer-snapshot");

Review Comment:
   Open/close changes done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] KAFKA-9693: Kafka latency spikes caused by log segment flush on roll [kafka]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1827074604

   This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has  merge conflicts, please update it with the latest from trunk (or appropriate release branch) <p> If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on a diff in pull request #13782: KAFKA-9693: Kafka latency spikes caused by log segment flush on roll

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on code in PR #13782:
URL: https://github.com/apache/kafka/pull/13782#discussion_r1230800451


##########
clients/src/main/java/org/apache/kafka/common/utils/Utils.java:
##########
@@ -1007,6 +1007,30 @@ public static void flushDir(Path path) throws IOException {
         }
     }
 
+    /**
+     * Flushes dirty file to guarantee crash consistency.
+     *
+     * @throws IOException if flushing the file fails.
+     */
+    public static void flushFile(Path path) throws IOException {
+        if (path != null) {
+            try (FileChannel fileChannel = FileChannel.open(path, StandardOpenOption.READ)) {
+                fileChannel.force(true);
+            }
+        }
+    }
+
+    /**
+     * Flushes dirty file quietly, logs warning when exception happens.
+     */
+    public static void flushFileQuietly(Path path, String name) {
+        try {
+            flushFile(path);
+        } catch (IOException e) {
+            log.warn("Failed to flush {} at path {}", name, path);

Review Comment:
   Corrected.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] Hangleton commented on pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "Hangleton (via GitHub)" <gi...@apache.org>.
Hangleton commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1568955091

   Many thanks for the patch and the collected data! Really interesting to see the impact of this change. A few questions: 
   
   - What storage device and file system are used in the test?
   - Would you have a real-life workload where the impact of this change can be quantified? The workload generated by the producer-perf-test.sh exhibits the problem the most because the segments of all replicas on the brokers start rolling at the same time. Which is why it is also interesting to assess the impact using topic-partitions which have different ingress rate and/or use segments of different sizes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1572027303

   > Are all the graphs shared for OMB and Kafka Tussle generated for Kafka with the fix in this PR?
   Graphs with the fix noted in first description comment  - marked with `kafka_2.13-3.6.0-snapshot-fix` label.
   
   Other graphs in latter comment are examples of how rolling affects results on different configurations and benchmarks using regular Kafka release.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] divijvaidya commented on a diff in pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya commented on code in PR #13782:
URL: https://github.com/apache/kafka/pull/13782#discussion_r1211516603


##########
storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerStateManager.java:
##########
@@ -679,9 +688,21 @@ private static void writeSnapshot(File file, Map<Long, ProducerStateEntry> entri
         long crc = Crc32C.compute(buffer, PRODUCER_ENTRIES_OFFSET, buffer.limit() - PRODUCER_ENTRIES_OFFSET);
         ByteUtils.writeUnsignedInt(buffer, CRC_OFFSET, crc);
 
-        try (FileChannel fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE)) {
-            fileChannel.write(buffer);
+        FileChannel fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE);
+        fileChannel.write(buffer);
+        if (scheduler != null) {
+            scheduler.scheduleOnce("flush-snapshot", () -> closeSnapshotFile(fileChannel));
+        } else {
+            closeSnapshotFile(fileChannel);
+        }
+    }
+
+    private static void closeSnapshotFile(FileChannel fileChannel) {
+        try {
             fileChannel.force(true);
+            fileChannel.close();

Review Comment:
   Let's decouple the exception handling of close from the above statement. We want to try releasing resources even if the flush threw an error.
   
   Also, you can use `Utils.closeQuietly` to close the fileChannel. It will not propagate the exception but instead just log a message at `warn` level. This will ensure that the thread is not terminated anywhere else after the exception.



##########
storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerStateManager.java:
##########
@@ -679,9 +688,21 @@ private static void writeSnapshot(File file, Map<Long, ProducerStateEntry> entri
         long crc = Crc32C.compute(buffer, PRODUCER_ENTRIES_OFFSET, buffer.limit() - PRODUCER_ENTRIES_OFFSET);
         ByteUtils.writeUnsignedInt(buffer, CRC_OFFSET, crc);
 
-        try (FileChannel fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE)) {
-            fileChannel.write(buffer);
+        FileChannel fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE);
+        fileChannel.write(buffer);
+        if (scheduler != null) {
+            scheduler.scheduleOnce("flush-snapshot", () -> closeSnapshotFile(fileChannel));

Review Comment:
   suggestion: flush-producer-snapshot



##########
core/src/main/scala/kafka/log/UnifiedLog.scala:
##########
@@ -1498,7 +1498,7 @@ class UnifiedLog(@volatile var logStartOffset: Long,
     // may actually be ahead of the current producer state end offset (which corresponds to the log end offset),
     // we manually override the state offset here prior to taking the snapshot.
     producerStateManager.updateMapEndOffset(newSegment.baseOffset)
-    producerStateManager.takeSnapshot()
+    producerStateManager.takeSnapshot(scheduler)

Review Comment:
   I am wondering if we need to increase the default size of background threads since we are adding more responsibility to it. Thoughts?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1572620634

   Provided updated change:
   returned original try-with-resource on writing, added utility method for flushing:
   ```
           try (FileChannel fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE)) {
               fileChannel.write(buffer);
           }
           if (scheduler != null) {
               scheduler.scheduleOnce("flush-producer-snapshot", () -> Utils.flushFileQuietly(file.toPath(), "producer-snapshot"));
           } else {
               Utils.flushFileQuietly(file.toPath(), "producer-snapshot");
           }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] showuon commented on pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "showuon (via GitHub)" <gi...@apache.org>.
showuon commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1578040307

   Also, there are compiling error, please fix it. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] divijvaidya commented on pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1591281777

   Hey @novosibman could you please respond to rest of the comments at https://github.com/apache/kafka/pull/13782#pullrequestreview-1461194326 and https://github.com/apache/kafka/pull/13782#discussion_r1216711941


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] divijvaidya commented on pull request #13782: KAFKA-9693: Kafka latency spikes caused by log segment flush on roll

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1592865706

   @showuon - with this change we don't have consistent data on different flushed files on disk (since earlier they were flushed together but now it's done async). I want to ensure that this inconsistent is ok and recovery will not get hampered by it. Please wait for my review before merging this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on a diff in pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on code in PR #13782:
URL: https://github.com/apache/kafka/pull/13782#discussion_r1223373702


##########
storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerStateManager.java:
##########
@@ -430,11 +428,19 @@ public Optional<ProducerStateEntry> lastEntry(long producerId) {
      * Take a snapshot at the current end offset if one does not already exist.
      */
     public void takeSnapshot() throws IOException {
+        takeSnapshot(null);
+    }
+
+    /**
+     * Take a snapshot at the current end offset if one does not already exist.
+     * Flush the snapshot asynchronously if scheduler != null
+     */
+    public void takeSnapshot(Scheduler scheduler) throws IOException {

Review Comment:
   IOException still will be thrown on open/write/close operations. Force (flush) operation running by scheduler in a separate thread will write log warning only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1583094847

   Open/close changes provided. 
   Also corrected style check issue (in task ':storage:checkstyleMain').


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] showuon commented on a diff in pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "showuon (via GitHub)" <gi...@apache.org>.
showuon commented on code in PR #13782:
URL: https://github.com/apache/kafka/pull/13782#discussion_r1219035983


##########
storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerStateManager.java:
##########
@@ -681,7 +687,12 @@ private static void writeSnapshot(File file, Map<Long, ProducerStateEntry> entri
 
         try (FileChannel fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE)) {
             fileChannel.write(buffer);
-            fileChannel.force(true);
+        }
+
+        if (scheduler != null) {
+            scheduler.scheduleOnce("flush-producer-snapshot", () -> Utils.flushFileQuietly(file.toPath(), "producer-snapshot"));
+        } else {
+            Utils.flushFileQuietly(file.toPath(), "producer-snapshot");

Review Comment:
   After this change, the fileChannel needs to open/close twice under `scheduler == null` case. Maybe we can jsut do it like this:
   ```
   try (FileChannel fileChannel = FileChannel.open(file.toPath(), StandardOpenOption.CREATE, StandardOpenOption.WRITE)) {
               fileChannel.write(buffer);
              if (scheduler == null) {
                   // directly flush to disk
                   fileChannel.force(true);
              }
           }
   
           if (scheduler != null) {
               scheduler.scheduleOnce("flush-producer-snapshot", () -> Utils.flushFileQuietly(file.toPath(), "producer-snapshot"));
           }
   ```
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] KAFKA-9693: Kafka latency spikes caused by log segment flush on roll [kafka]

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya closed pull request #13782: KAFKA-9693: Kafka latency spikes caused by log segment flush on roll
URL: https://github.com/apache/kafka/pull/13782


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] Hangleton commented on pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "Hangleton (via GitHub)" <gi...@apache.org>.
Hangleton commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1571604264

   @novosibman 
   
   Thanks for the reply. In the tests we conducted in [KAFKA-9693](https://issues.apache.org/jira/browse/KAFKA-9693), nvme SSDs and ext4 were used along with jbd2, which likely penalized performance.
   
   Are all the graphs shared for OMB and Kafka Tussle generated for Kafka with the fix in this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] novosibman commented on a diff in pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "novosibman (via GitHub)" <gi...@apache.org>.
novosibman commented on code in PR #13782:
URL: https://github.com/apache/kafka/pull/13782#discussion_r1230002264


##########
clients/src/main/java/org/apache/kafka/common/utils/Utils.java:
##########
@@ -976,6 +976,30 @@ public static void flushDir(Path path) throws IOException {
         }
     }
 
+    /**
+     * Flushes dirty file to guarantee crash consistency.
+     *
+     * @throws IOException if flushing the file fails.
+     */
+    public static void flushFile(Path path) throws IOException {
+        if (path != null) {
+            try (FileChannel fileChannel = FileChannel.open(path, StandardOpenOption.READ)) {
+                fileChannel.force(true);
+            }
+        }
+    }
+
+    /**
+     * Flushes dirty file quietly, logs warning when exception happens.
+     */
+    public static void flushFileQuietly(Path path, String name) {
+        try {
+            flushFile(path);
+        } catch (IOException e) {
+            log.warn("Failed to flush {} at path {} with exception {}", name, path, e);

Review Comment:
   Third parameter removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] divijvaidya commented on a diff in pull request #13782: Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll - trunk version

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya commented on code in PR #13782:
URL: https://github.com/apache/kafka/pull/13782#discussion_r1216711941


##########
clients/src/main/java/org/apache/kafka/common/utils/Utils.java:
##########
@@ -976,6 +976,30 @@ public static void flushDir(Path path) throws IOException {
         }
     }
 
+    /**
+     * Flushes dirty file to guarantee crash consistency.
+     *
+     * @throws IOException if flushing the file fails.
+     */
+    public static void flushFile(Path path) throws IOException {
+        if (path != null) {
+            try (FileChannel fileChannel = FileChannel.open(path, StandardOpenOption.READ)) {
+                fileChannel.force(true);
+            }
+        }
+    }
+
+    /**
+     * Flushes dirty file quietly, logs warning when exception happens.
+     */
+    public static void flushFileQuietly(Path path, String name) {
+        try {
+            flushFile(path);
+        } catch (IOException e) {
+            log.warn("Failed to flush {} at path {} with exception {}", name, path, e);

Review Comment:
   you can skip the third `{}` and change it to `Failed to flush {} at path {}`: https://www.slf4j.org/faq.html#paramException



##########
storage/src/main/java/org/apache/kafka/storage/internals/log/ProducerStateManager.java:
##########
@@ -430,11 +428,19 @@ public Optional<ProducerStateEntry> lastEntry(long producerId) {
      * Take a snapshot at the current end offset if one does not already exist.
      */
     public void takeSnapshot() throws IOException {
+        takeSnapshot(null);
+    }
+
+    /**
+     * Take a snapshot at the current end offset if one does not already exist.
+     * Flush the snapshot asynchronously if scheduler != null
+     */
+    public void takeSnapshot(Scheduler scheduler) throws IOException {

Review Comment:
   does this still throw IOException? Asking because I think we are eating up the exception in flushFileQuietly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] KAFKA-9693: Kafka latency spikes caused by log segment flush on roll [kafka]

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1923677557

   @novosibman @ocadaruma do we still need this change after https://github.com/apache/kafka/pull/14242/files? Asking because with the latter PR merged in, we are  not blocking request handler thread while flushing producer snapshot. This is same was what this PR is trying to achieve. Hence, I think this could be closed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] KAFKA-9693: Kafka latency spikes caused by log segment flush on roll [kafka]

Posted by "divijvaidya (via GitHub)" <gi...@apache.org>.
divijvaidya commented on PR #13782:
URL: https://github.com/apache/kafka/pull/13782#issuecomment-1923906249

   Thanks for checking @ocadaruma . I am going to close this PR, please feel free to re-open if you think that this is still not fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org