You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Stefan Fuhrmann <st...@wandisco.com> on 2015/06/16 21:57:41 UTC

Experiments with FlushFileBuffers on Windows

Hey there,

One of the links recently provided by Daniel Klima pointed
to a way to enable write caching even on USB devices.
So, I could use my Windows installation for experiments now
without the risk of brick-ing 2 grand worth of disks by pulling
the plug tens of times.

-- Stefan^2.


TL;DR
=====
FlushFileBuffers operates on whole files, not just the parts
written through the respective handle. Not calling it after rename
results in potential data loss. Calling it after rename eliminates
the problem at least in most cases.

Setup:
=====
I used the attached program to conduct 3 different experiments,
each trying a specific modification / fsync sequence. All would
write to an USB stick which had OS write cache enabled for it
in Windows 7.

All tests run an unlimited number of iterations - until there is an
I/O error (e.g. caused by disconnecting out the drive). For each
run, separate files and different file contents will being written
("run number xyz", repeated many times). So, we can determine
which file contents is complete and correct and whether all files
are present. Each successful iteration is logged to the console.
We expect the data for all these to be complete.

The stick got yanked out at a random point in time, reconnected
after about a minute, chkdsk /f run on it and then the program
output would be compared with the USB stick's content.

Experiment 1: fsync a file written through a different handle.
==============================================
Write the same contents to two files, write the same contents
100x alternating between the two files. Both files are the same
size >1MB and should be similarly "important" to the OS.
Close both files. Re-open the one written last and fsync it.
This re-open scenario is similar to what we do with the protorev
file.

Results:
* 10 runs were made, between 17 and 84 iterations each.
* 10x, the fsync'ed file and its contents has been complete
* 10x, the non-synced files were present and showed the
  correct file size. The contents of the last few of them were
  NUL bytes.

Interpretation:
Re-opening a file and fsync'ing it flushes *all* content changes
for that file - at least on Windows. The way we handle the
protorev file is correct.

Experiment 2: fsync before but not after rename
=======================================
This mimics the core of our "move-in-place" logic: Write a
small-ish file (here: 10 .. 20k to not get folded into the MFT)
with some temporary name, fsync and close it. Rename to
its final name in the same folder.

Results:
* 5 runs were made, between 182 and 435 iterations each.
* 1x the final file existed with the correct contents
* 3x the file .temp file existed for the last completed iteration.
* 1x even the final file for the previous iteration contained
  NULs. After that run, chkdsk reported and fixed a large
  number of issues.

Interpretation:
Not fsync'ing after rename will lead to data loss even with
NTFS. IOW, we don't have transactional guarantees for
"commit" on Windows servers at the moment.

The last case with the more severe corruption may be due
to the storage device not handling its buffers correctly.
The only thing we can do here is tell people to use battery-
backed storage.

Experiment 3: fsync before but *and* after rename
=======================================
Same as above but re-open the file after rename and fsync it.

Results:
* 10 runs were made, between 127 and 1984 iterations each.
* 7x the final file existed with the correct contents
* 1x the next temp already existed with size 0
  (this is also a correct state; the last complete iteration's
   final file existed with the correct contents)
* 1x the next temp already existed with correct contents
  (correct, same as before)
* 1x the last final file was missing, there was no temp file
  and the previous final file contained invalid data. After
  that run, there were various issues fixed by chkdsk.
  It was also the run with the most iterations.

Interpretation:
In 90% of the runs, fsync'ing after rename resulted in
correct disk contents. This is much better than the results
in Experiment 2. The remainder may be due to limitations
of the storage device and has been observed in Exp. 2
as well.

Re: Experiments with FlushFileBuffers on Windows

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Tue, Jun 23, 2015 at 5:09 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:

> On 16 June 2015 at 22:57, Stefan Fuhrmann <st...@wandisco.com>
> wrote:
> > Hey there,
> >
> > One of the links recently provided by Daniel Klima pointed
> > to a way to enable write caching even on USB devices.
> > So, I could use my Windows installation for experiments now
> > without the risk of brick-ing 2 grand worth of disks by pulling
> > the plug tens of times.
> >
> >
> > TL;DR
> > =====
> > FlushFileBuffers operates on whole files, not just the parts
> > written through the respective handle. Not calling it after rename
> > results in potential data loss. Calling it after rename eliminates
> > the problem at least in most cases.
> >
> > Setup:
> > =====
> > I used the attached program to conduct 3 different experiments,
> > each trying a specific modification / fsync sequence. All would
> > write to an USB stick which had OS write cache enabled for it
> > in Windows 7.
> >
> > All tests run an unlimited number of iterations - until there is an
> > I/O error (e.g. caused by disconnecting out the drive). For each
> > run, separate files and different file contents will being written
> > ("run number xyz", repeated many times). So, we can determine
> > which file contents is complete and correct and whether all files
> > are present. Each successful iteration is logged to the console.
> > We expect the data for all these to be complete.
> >
> > The stick got yanked out at a random point in time, reconnected
> > after about a minute, chkdsk /f run on it and then the program
> > output would be compared with the USB stick's content.
>
> I've tried to repeat your tests, but I failed to do that:
> 1. Your attached program miss some scripts around to perform real tests.
>

That source should work with any MBCS Win32 console application.
For your convenience, I now attached the full VS solution.

> 2. I don't have the same USB stick that you used in your tests :)
>

Well, any device that can be suddenly removed should do the trick
(USB, eSATA, something networked, ...). Even an internal disk will
do if you are willing to pull the plug. A VM might work as well if it
is being killed without informing the OS beforehand.

Also I don't think that NTFS on removable flash USB drive could be
> used to simulate powerloss scenario on Windows: removable disks are
> not available during the system boot, so Windows cannot replay NTFS
> journal during startup.
>

Windows can replay the journal upon mounting the volume - just
like any other volume that was not active during system startup.
To make really sure that issues get fixed, I ran chkdsk on the
volume before examining it. Are you suggesting that only the
volumes present at boot time get additional checking?

Also, a journal can only replay what has been written to it. In that
respect it is no different from any other data on disk. For a rename
to be permanent, it has to be recorded on disk "somehow"
"somewhere" and that requires physical I/O. Unless rename
is very slow on spinning disks, no such I/O is happening.

The only half-way option that the OS has is to write the journal
entry directly into the disk cache (without flushing it). That's still
fast while being much safer than the OS cache. Virtualized disks
then behave like battery-backed disks and will not show data loss.

> Instead of this I tweaked 'repos-test 25' to emulate concurrent
> commits of 10kb files in 4 parallel threads (see attached very dirty
> patch).
>

That looks o.k. and should be able to reproduce issues. There are
two downsides compared to my simpler example code:

* The fsync after rename is only one of mutliple fsync ops during
  a commit. Your chances of hitting it is 10..20% vs. 50 to ~100%
  in my setup. So, you may need 10s of runs for good confidence.

* fsyncs in other threads might trigger metadata and journal flushes
  that effectively act like an fsync after the previous rename.
  Without further analysis of what will be sync'ed when by Windows,
  one could expect to hit a critical situation in even fewer cases.

IOW, that setup is complex enough to expose all sorts of problems
if they exist. But it may greatly reduce the incident rate for the one
problem that we are investigating.

> Then I've performed several tests on Windows Server 2012 R2 running
> VMWare Workstation 9 forcing power off after 300-400 commits. I've
> performed 10x tests and never get repository corruption even when I
> removed FlushFileBuffers() call *after* rename.

I assume you only ran 'svnadmin verify' or something to that
effect. Did you then verify that the last reported HEAD revision
was not lost?

Given my commentary above, 10 runs must not be enough
while there is no need to wait for 300..400 commits (in case
that takes considerable time in your test environment).

However, assuming that you actually compared expected
HEAD vs. reported HEAD, your test demonstrates that the
incident rate is low - far lower than what you would see with
no fsync at all.

> During the restart the
> OS may report that it recovered volume data, but after that the
> repository data remain in the consistent state. Removing other
> FlushFileBuffers() calls, results repository corruption after two
> runs.
>

That demonstrates that fsync is at least a meaningful operation
in your test setup and that resetting the VM can make you lose
at least some data (despite virtualized HW and specific drivers
that might change guest-OS-side buffering).

> While I agree that passing MOVEFILE_COPY_ALLOWED to MoveFileEx() is a
> bug, but calling FlushFileBuffers() is not necessary at least in case
> of NTFS on permanently connected disk. I suppose it happens because
> MoveFileEx() already journaled which means that journal is flushed to
> disk before operation completes. But we may add MOVEFILE_WRITE_THROUGH
> flag to make sure that this operation will be synced on other
> filesystems or network shares, but this require more Windows specific
> code.
>

Yes, that is a useful improvement.

In the meantime, I'm halfway through eliminating the need for
the final move-into-place of 'current' to be persistent in FSX.
For revprops and rev data, that already isn't a problem anymore
because they get written / completed and fsynced in their
location.

-- Stefan^2.

Re: Experiments with FlushFileBuffers on Windows

Posted by Ivan Zhakov <iv...@visualsvn.com>.

On 16 June 2015 at 22:57, Stefan Fuhrmann <st...@wandisco.com> wrote:
> Hey there,
>
> One of the links recently provided by Daniel Klima pointed
> to a way to enable write caching even on USB devices.
> So, I could use my Windows installation for experiments now
> without the risk of brick-ing 2 grand worth of disks by pulling
> the plug tens of times.
>
>
> TL;DR
> =====
> FlushFileBuffers operates on whole files, not just the parts
> written through the respective handle. Not calling it after rename
> results in potential data loss. Calling it after rename eliminates
> the problem at least in most cases.
>
> Setup:
> =====
> I used the attached program to conduct 3 different experiments,
> each trying a specific modification / fsync sequence. All would
> write to an USB stick which had OS write cache enabled for it
> in Windows 7.
>
> All tests run an unlimited number of iterations - until there is an
> I/O error (e.g. caused by disconnecting out the drive). For each
> run, separate files and different file contents will being written
> ("run number xyz", repeated many times). So, we can determine
> which file contents is complete and correct and whether all files
> are present. Each successful iteration is logged to the console.
> We expect the data for all these to be complete.
>
> The stick got yanked out at a random point in time, reconnected
> after about a minute, chkdsk /f run on it and then the program
> output would be compared with the USB stick's content.

I've tried to repeat your tests, but I failed to do that:
1. Your attached program miss some scripts around to perform real tests.
2. I don't have the same USB stick that you used in your tests :)

Also I don't think that NTFS on removable flash USB drive could be
used to simulate powerloss scenario on Windows: removable disks are
not available during the system boot, so Windows cannot replay NTFS
journal during startup.

Instead of this I tweaked 'repos-test 25' to emulate concurrent
commits of 10kb files in 4 parallel threads (see attached very dirty
patch).

Then I've performed several tests on Windows Server 2012 R2 running
VMWare Workstation 9 forcing power off after 300-400 commits. I've
performed 10x tests and never get repository corruption even when I
removed FlushFileBuffers() call *after* rename. During the restart the
OS may report that it recovered volume data, but after that the
repository data remain in the consistent state. Removing other
FlushFileBuffers() calls, results repository corruption after two
runs.

While I agree that passing MOVEFILE_COPY_ALLOWED to MoveFileEx() is a
bug, but calling FlushFileBuffers() is not necessary at least in case
of NTFS on permanently connected disk. I suppose it happens because
MoveFileEx() already journaled which means that journal is flushed to
disk before operation completes. But we may add MOVEFILE_WRITE_THROUGH
flag to make sure that this operation will be synced on other
filesystems or network shares, but this require more Windows specific
code.

---
Ivan Zhakov