You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "José Armando García Sancio (Jira)" <ji...@apache.org> on 2023/08/07 17:07:00 UTC

[jira] [Created] (KAFKA-15312) FileRawSnapshotWriter must flush before atomic move

José Armando García Sancio created KAFKA-15312:
--------------------------------------------------

Summary: FileRawSnapshotWriter must flush before atomic move
Key: KAFKA-15312
URL: https://issues.apache.org/jira/browse/KAFKA-15312
Project: Kafka
Issue Type: Bug
Components: kraft
Reporter: José Armando García Sancio
Assignee: José Armando García Sancio
Fix For: 3.6.0

Not all file system fsync to disk on close. For KRaft to guarantee that the data has made it to disk before calling rename it needs to make sure that the file has been fsync.

We have seen cases were the snapshot file has zero-length data on ext4 file system.
{quote} "Delayed allocation" means that the filesystem tries to delay the allocation of physical disk blocks for written data for as long as possible. This policy brings some important performance benefits. Many files are short-lived; delayed allocation can keep the system from writing fleeting temporary files to disk at all. And, for longer-lived files, delayed allocation allows the kernel to accumulate more data and to allocate the blocks for data contiguously, speeding up both the write and any subsequent reads of that data. It's an important optimization which is found in most contemporary filesystems.

But, if blocks have not been allocated for a file, there is no need to write them quickly as a security measure. Since the blocks do not yet exist, it is not possible to read somebody else's data from them. So ext4 will not (cannot) write out unallocated blocks as part of the next journal commit cycle. Those blocks will, instead, wait until the kernel decides to flush them out; at that point, physical blocks will be allocated on disk and the data will be made persistent. The kernel doesn't like to let file data sit unwritten for too long, but it can still take a minute or so (with the default settings) for that data to be flushed - far longer than the five seconds normally seen with ext3. And that is why a crash can cause the loss of quite a bit more data when ext4 is being used.
{quote}
from: [https://lwn.net/Articles/322823/]
{quote}auto_da_alloc(*), noauto_da_alloc

Many broken applications don't use fsync() when replacing existing files via patterns such as fd = open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and force that any delayed allocation blocks are allocated such that at the next journal commit, in the default data=ordered mode, the data blocks of the new file are forced to disk before the rename() operation is committed. This provides roughly the same level of guarantees as ext3, and avoids the "zero-length" problem that can happen when a system crashes before the delayed allocation blocks are forced to disk.
{quote}
from: [https://www.kernel.org/doc/html/latest/admin-guide/ext4.html]

--
This message was sent by Atlassian Jira
(v8.20.10#820010)