You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Stefan Fuhrmann <st...@wandisco.com> on 2015/05/28 19:47:12 UTC

Efficient and effective fsync during commit

Hi all,

Most of us would agree that way we fsync FS changes
in FSFS and FSX is slow (~10 commits / sec on a SSD,
YMMV) and not even all changes are fully fsync'ed
(repo creation, upgrade).

>From a high-level perspective, a commit is are simple
3-step process:

1. Write rev contents & props to their final location.
   They are not accessible until 'current' gets bumped.
   Write a the new 'current' temporary contents to a temp file.
2. Fsync everything we wrote in step 1 to disk.
   Still not visible to any other FS API user.
3. Atomically switch to the new 'current' contents and
   fsync that change.

Today, we fsync "locally" as part of whatever copy or
move operation we need. That is inefficient because
on POSIX platforms, we tend to fsync the same folder
more than once, on Windows we would need to sync
the files twice (content before and metadata after the
rename). Basically, missing the context info, we need
to play it safe and do much more work than we would
actually have to.

In the future, we should implement step 1 as simple
non-fsync'ing file operations. Then explicitly sync every
file, and on POSIX the folders, once. Step 2 does not
have any atomicity requirements. Finally, do the 'current'
rename. This also only requires a single fsync b/c
the temp file will be in the same folder.

On top of that, all operations in step 2 can be run
concurrently. I did that for FSX on Linux using aio_fsync
and it got 3x as fast. Windows can do something similar.
I wrapped that functionality into a "batch_fsync" object
with a few methods on it. You simply push paths into it,
it drops duplicates, and finally you ask it to fsync all.

To get even faster, a FS instance can piggy-back the
allocate a new txn ID onto this process. In normal
operation, the svn_fs_t will be cleaned up and the
txn ID be wasted. In 'svnadmin load', however, we
save the 2x fsync dance with the 'txn-current' file.

So, I suggest to implement svn_io__batch_fsync_t
and use that for all durable FS modification, explicitly
excluding in-txn operations.

Does that sound like a plan?

-- Stefan^2.

Re: Efficient and effective fsync during commit

Posted by Daniel Klíma <da...@gmail.com>.
2015-05-29 20:55 GMT+02:00 Branko Čibej <br...@wandisco.com>:

> On 29.05.2015 18:23, Ivan Zhakov wrote:
> > On 29 May 2015 at 18:55, Stefan Fuhrmann <st...@wandisco.com>
> wrote:
> >> You might be right. So, if you care about repository
> >> integrity, you should use your MSDN subscription and
> >> ask MS for clarification on FlushFileBuffers() behaviour.
> > You also may request MSDN subscription and ask MS for clarification or
> > keep Windows code as it was before.
>
> I doubt plain MSDN will give any useful answers; it doesn't even mention
> FCB's that I could find. The WDK might. Once upon a time there was a
> filesystem development kit that did give the answers to such questions;
> but it was only available for a humongous fee.
>
> -- Brane
>
>
Hello all,
following links might contain sought answers:
The Old New Thing:
"We're currently using FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH,
but we would like our WriteFile to go even faster"
http://blogs.msdn.com/b/oldnewthing/archive/2014/03/06/10505524.aspx

"Flushing your performance down the drain, that is"
http://blogs.msdn.com/b/oldnewthing/archive/2010/09/09/10059575.aspx

"If NTFS is a robust journaling file system, why do you have to be careful
when using it with a USB thumb drive?"
http://blogs.msdn.com/b/oldnewthing/archive/2013/01/01/10381556.aspx

Installable File System Drivers:
https://msdn.microsoft.com/en-us/library/windows/hardware/dn641617(v=vs.85).aspx

Daniel Klima

Re: Efficient and effective fsync during commit

Posted by Branko Čibej <br...@wandisco.com>.
On 29.05.2015 18:23, Ivan Zhakov wrote:
> On 29 May 2015 at 18:55, Stefan Fuhrmann <st...@wandisco.com> wrote:
>> You might be right. So, if you care about repository
>> integrity, you should use your MSDN subscription and
>> ask MS for clarification on FlushFileBuffers() behaviour.
> You also may request MSDN subscription and ask MS for clarification or
> keep Windows code as it was before.

I doubt plain MSDN will give any useful answers; it doesn't even mention
FCB's that I could find. The WDK might. Once upon a time there was a
filesystem development kit that did give the answers to such questions;
but it was only available for a humongous fee.

-- Brane


Re: Efficient and effective fsync during commit

Posted by Stefan Sperling <st...@elego.de>.
On Tue, Jun 16, 2015 at 11:01:36PM +0200, Stefan Fuhrmann wrote:
> I feel this is something we need to talk about privately in Berlin.

I'll make very sure indeed you two get to share a double bedroom at the hotel.
Promise.

Re: Efficient and effective fsync during commit

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Mon, Jun 15, 2015 at 5:36 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:

> On 12 June 2015 at 15:11, Stefan Fuhrmann <st...@wandisco.com>
> wrote:
> > On Fri, May 29, 2015 at 6:23 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> >>
> >> On 29 May 2015 at 18:55, Stefan Fuhrmann <st...@wandisco.com>
> >> wrote:
> >> > If you assume / suspect that FlushFileBuffers() only
> >> > operates on the open handle, i.e. only flushes those
> >> > changes made through that thandle, then you assume
> >> > that our commit process is seriously broken:
> >> >
> >> > For every PUT, we open the protorev file, append the
> >> > respective txdelta and close the file again. Since the
> >> > final flush uses yet another handle, this implies that
> >> > most of the revision data in each rev file does not get
> >> > fsync'ed and may be lost upon power failure.
> >> >
> >> > You might be right. So, if you care about repository
> >> > integrity, you should use your MSDN subscription and
> >> > ask MS for clarification on FlushFileBuffers() behaviour.
> >> You also may request MSDN subscription and ask MS for clarification or
> >> keep Windows code as it was before.
> >
> >
> > To be clear: You are proposing that the code on Windows
> > is fundamentally broken (revision contents not being
> > committed) while I think we "only" have a persistence
> > issue with renames. Since your business depends on
> > you being wrong, it would be in your best self-interest
> > to go and find out ...
> >
> > Of course, I could apply for an MSDN subscription, wait
> > for it be approved etc. but I think it would be fairer if you
> > could check the Windows side of things while I try to get
> > some answers for POSIX.
> >
> Am I understand you properly, that *your business* does not depend on
> Windows and you just do not care about this? You are wrong if you
> think this way. Please try to imagine what will happen if a Windows
> developer broke/slowdown the code on Unix and then just ask Unix
> people to fix their problems, because it is their business.
>
> Please elaborate if I get your position wrongly.
>

It boils down to 3 points:

Burden of proof. If someone makes a claim, it is their
responsibility to provide evidence. My claim is "we need
to fsync after rename". You made an independent claim
"fsync after close does not flush all contents". We both
have to provide evidence (or say that we can't right now,
or kindly ask others to help, or ...)

Cooperation. I spent a few days researching the both
issues and came up with evidence (arguments can be
evidence as well) in support of my claim and counter
to your claim. So, I did work for *both* of us. It is perfectly
fine to not be convinced by the evidence and e.g.
require confirmation by a "higher authority". Then I
provided you with input for the kind of information you
would need to ask MS to either prove or disprove *your*
claim as well as *mine*. Finally, I offered to handle
the POSIX side of both claims on my own and asked
you to file the incident at MS. To me, this looks like
a lot of input on my part and not like ignoring anybody
or anysystem.

Self-interest. This is the part that completely puzzles me.
I don't meant it as an allegation; it is just the part where I
simply don't get you. Like, at all. If *my* livelihood would
depend on the quality of a certain product on a specific
platform and if then I would suspect a major flaw in its
transaction handling, I would try to at least figure out
how bad things are. And I would want to know it asap.
If I didn't have the time to investigate right away or if
I thought that the issue isn't urgent, I might say so and
ask for people's help in the meantime.

The last point is the critical one, IMO. It prevents me
from understanding you and has lead to needless fights
in the past. I feel this is something we need to talk
about privately in Berlin.

-- Stefan^2.

Re: Efficient and effective fsync during commit

Posted by Branko Čibej <br...@wandisco.com>.
On 15.06.2015 18:24, Mark Phippard wrote:
> On Mon, Jun 15, 2015 at 12:15 PM, Branko Čibej <brane@wandisco.com
> <ma...@wandisco.com>> wrote:
>
>     On 15.06.2015 17:36, Ivan Zhakov wrote:
>     > On 12 June 2015 at 15:11, Stefan Fuhrmann
>     <stefan.fuhrmann@wandisco.com
>     <ma...@wandisco.com>> wrote:
>     >> To be clear: You are proposing that the code on Windows
>     >> is fundamentally broken (revision contents not being
>     >> committed) while I think we "only" have a persistence
>     >> issue with renames. Since your business depends on
>     >> you being wrong, it would be in your best self-interest
>     >> to go and find out ...
>     >>
>     >> Of course, I could apply for an MSDN subscription, wait
>     >> for it be approved etc. but I think it would be fairer if you
>     >> could check the Windows side of things while I try to get
>     >> some answers for POSIX.
>     >>
>     > Am I understand you properly, that *your business* does not
>     depend on
>     > Windows and you just do not care about this
>
>     Ahem. So ... this has gone somewhat off the straight and narrow. Let's
>     leave business and self-interest out of this (all parties) and look at
>     the actual problem instead.
>
>     We've always sort of assumed around here that whoever had the working
>     configuration/platform on hand would be more likely to be able to
>     verify
>     some platform-dependent edge case or other. Windows is decidedly a bit
>     of a special case because, traditionally, setting up a build and test
>     environment for Subversion has been horribly complicated (as I handily
>     reminded myself just the other day as I was setting up a VM to get a
>     Windows vote in for 1.9.0-rc2 ... and I'll happily admit part of
>     the blame).
>
>     Stefan, for the future, I do think it wouldn't hurt you to get
>     your MSDN
>     subscription and set up a build environment if you intend to make
>     platform-dependent changes that can't be verified otherwise.
>     That's just
>     common sense. As it's also common sense for Ivan to verify such
>     changes
>     instead of placing all the burden on you.
>
>
> Not to go too far off-topic, but is it even true that you still need
> MSDN?  I thought the compilers and build tools were available for free
> now?  There is even a free version of Visual Studio that is fully
> functional.
>
> https://www.visualstudio.com/en-us/products/visual-studio-community-vs.aspx
>
> Before this was available, I recall the SDK now includes the compilers
> and that is also freely available.

Versions of compilers and Visual Studio are available for free, yes. You
still have to build a Windows machine or VM; the MSDN subscription that
is available to all ASF committers gives you the ability to do that
legally and for free.

-- Brane

Re: Efficient and effective fsync during commit

Posted by Mark Phippard <ma...@gmail.com>.
On Mon, Jun 15, 2015 at 12:15 PM, Branko Čibej <br...@wandisco.com> wrote:

> On 15.06.2015 17:36, Ivan Zhakov wrote:
> > On 12 June 2015 at 15:11, Stefan Fuhrmann <st...@wandisco.com>
> wrote:
> >> To be clear: You are proposing that the code on Windows
> >> is fundamentally broken (revision contents not being
> >> committed) while I think we "only" have a persistence
> >> issue with renames. Since your business depends on
> >> you being wrong, it would be in your best self-interest
> >> to go and find out ...
> >>
> >> Of course, I could apply for an MSDN subscription, wait
> >> for it be approved etc. but I think it would be fairer if you
> >> could check the Windows side of things while I try to get
> >> some answers for POSIX.
> >>
> > Am I understand you properly, that *your business* does not depend on
> > Windows and you just do not care about this
>
> Ahem. So ... this has gone somewhat off the straight and narrow. Let's
> leave business and self-interest out of this (all parties) and look at
> the actual problem instead.
>
> We've always sort of assumed around here that whoever had the working
> configuration/platform on hand would be more likely to be able to verify
> some platform-dependent edge case or other. Windows is decidedly a bit
> of a special case because, traditionally, setting up a build and test
> environment for Subversion has been horribly complicated (as I handily
> reminded myself just the other day as I was setting up a VM to get a
> Windows vote in for 1.9.0-rc2 ... and I'll happily admit part of the
> blame).
>
> Stefan, for the future, I do think it wouldn't hurt you to get your MSDN
> subscription and set up a build environment if you intend to make
> platform-dependent changes that can't be verified otherwise. That's just
> common sense. As it's also common sense for Ivan to verify such changes
> instead of placing all the burden on you.
>
>
Not to go too far off-topic, but is it even true that you still need MSDN?
I thought the compilers and build tools were available for free now?  There
is even a free version of Visual Studio that is fully functional.

https://www.visualstudio.com/en-us/products/visual-studio-community-vs.aspx

Before this was available, I recall the SDK now includes the compilers and
that is also freely available.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: Efficient and effective fsync during commit

Posted by Branko Čibej <br...@wandisco.com>.
On 15.06.2015 17:36, Ivan Zhakov wrote:
> On 12 June 2015 at 15:11, Stefan Fuhrmann <st...@wandisco.com> wrote:
>> To be clear: You are proposing that the code on Windows
>> is fundamentally broken (revision contents not being
>> committed) while I think we "only" have a persistence
>> issue with renames. Since your business depends on
>> you being wrong, it would be in your best self-interest
>> to go and find out ...
>>
>> Of course, I could apply for an MSDN subscription, wait
>> for it be approved etc. but I think it would be fairer if you
>> could check the Windows side of things while I try to get
>> some answers for POSIX.
>>
> Am I understand you properly, that *your business* does not depend on
> Windows and you just do not care about this

Ahem. So ... this has gone somewhat off the straight and narrow. Let's
leave business and self-interest out of this (all parties) and look at
the actual problem instead.

We've always sort of assumed around here that whoever had the working
configuration/platform on hand would be more likely to be able to verify
some platform-dependent edge case or other. Windows is decidedly a bit
of a special case because, traditionally, setting up a build and test
environment for Subversion has been horribly complicated (as I handily
reminded myself just the other day as I was setting up a VM to get a
Windows vote in for 1.9.0-rc2 ... and I'll happily admit part of the blame).

Stefan, for the future, I do think it wouldn't hurt you to get your MSDN
subscription and set up a build environment if you intend to make
platform-dependent changes that can't be verified otherwise. That's just
common sense. As it's also common sense for Ivan to verify such changes
instead of placing all the burden on you.

We're supposed to be trying to work together towards a common goal,
right, not pass the hot potato around? :)

-- Brane

Re: Efficient and effective fsync during commit

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On 12 June 2015 at 15:11, Stefan Fuhrmann <st...@wandisco.com> wrote:
> On Fri, May 29, 2015 at 6:23 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
>>
>> On 29 May 2015 at 18:55, Stefan Fuhrmann <st...@wandisco.com>
>> wrote:
>> > On Fri, May 29, 2015 at 4:14 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
>> >> On 28 May 2015 at 20:47, Stefan Fuhrmann <st...@wandisco.com>
>> >> wrote:
>> >>> Hi all,
>> >>>
>> >>> Most of us would agree that way we fsync FS changes
>> >>> in FSFS and FSX is slow (~10 commits / sec on a SSD,
>> >>> YMMV) and not even all changes are fully fsync'ed
>> >>> (repo creation, upgrade).
>> >>>
>> >> The first question is it really a problem?
>> >
>> > Recently, we had customers wondering why their servers
>> > wouldn't serve more than 20 commits/s (even on enterprise
>> > SSDs and with various OS file system tuning options).
>> > With QA bots constantly creating snapshots and tags,
>> > there isn't too much head room anymore.
>> >
>> Ack. Was it Windows or Linux?
>
>
[...]

>> > If you assume / suspect that FlushFileBuffers() only
>> > operates on the open handle, i.e. only flushes those
>> > changes made through that thandle, then you assume
>> > that our commit process is seriously broken:
>> >
>> > For every PUT, we open the protorev file, append the
>> > respective txdelta and close the file again. Since the
>> > final flush uses yet another handle, this implies that
>> > most of the revision data in each rev file does not get
>> > fsync'ed and may be lost upon power failure.
>> >
>> > You might be right. So, if you care about repository
>> > integrity, you should use your MSDN subscription and
>> > ask MS for clarification on FlushFileBuffers() behaviour.
>> You also may request MSDN subscription and ask MS for clarification or
>> keep Windows code as it was before.
>
>
> To be clear: You are proposing that the code on Windows
> is fundamentally broken (revision contents not being
> committed) while I think we "only" have a persistence
> issue with renames. Since your business depends on
> you being wrong, it would be in your best self-interest
> to go and find out ...
>
> Of course, I could apply for an MSDN subscription, wait
> for it be approved etc. but I think it would be fairer if you
> could check the Windows side of things while I try to get
> some answers for POSIX.
>
Am I understand you properly, that *your business* does not depend on
Windows and you just do not care about this? You are wrong if you
think this way. Please try to imagine what will happen if a Windows
developer broke/slowdown the code on Unix and then just ask Unix
people to fix their problems, because it is their business.

Please elaborate if I get your position wrongly.

[...]


-- 
Ivan Zhakov

Re: Efficient and effective fsync during commit

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Fri, May 29, 2015 at 6:23 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:

> On 29 May 2015 at 18:55, Stefan Fuhrmann <st...@wandisco.com>
> wrote:
> > On Fri, May 29, 2015 at 4:14 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> >> On 28 May 2015 at 20:47, Stefan Fuhrmann <st...@wandisco.com>
> wrote:
> >>> Hi all,
> >>>
> >>> Most of us would agree that way we fsync FS changes
> >>> in FSFS and FSX is slow (~10 commits / sec on a SSD,
> >>> YMMV) and not even all changes are fully fsync'ed
> >>> (repo creation, upgrade).
> >>>
> >> The first question is it really a problem?
> >
> > Recently, we had customers wondering why their servers
> > wouldn't serve more than 20 commits/s (even on enterprise
> > SSDs and with various OS file system tuning options).
> > With QA bots constantly creating snapshots and tags,
> > there isn't too much head room anymore.
> >
> Ack. Was it Windows or Linux?
>

I do not know. My guess would be some Unix.


> >> I mean that usually commits
> >> are not that often. They are maintenance tasks like 'svnadmin load'
> >> that perform commits very often, but it could be fixed with
> >> '--fsfs-no-sync' option to 'svnadmin load' like we had for BDB.
> >
> > That would be a workable approach. Adding a bunch
> > of if statements.
> Something like 5-15..
>

Sure. It's not impossible to do but not exactly a
local change either. I'd rather centralize the fsync
management before adding switches to it.


> > It would not help for svnsync, though.
> Agree. We could expose this option in fsfs.conf, like we had in BDB
> but it would dangerous for production use even with properly power
> loss protected hardware.
>

It's hard to say where to put that option. Could be
per operation or per storage / repo. The latter implies
that a server config option would also be a possibility.

What we would ideally want is something that is not
even generally available, IIC: Flush cache buffers
but not the HW buffers. Some overhead but little added
latency and relying on battery-backed storage.

>>> From a high-level perspective, a commit is are simple
> >>> 3-step process:
> >>>
> >>> 1. Write rev contents & props to their final location.
> >>>    They are not accessible until 'current' gets bumped.
> >>>    Write a the new 'current' temporary contents to a temp file.
> >>> 2. Fsync everything we wrote in step 1 to disk.
> >>>    Still not visible to any other FS API user.
> >>> 3. Atomically switch to the new 'current' contents and
> >>>    fsync that change.
> >>>
> >>> Today, we fsync "locally" as part of whatever copy or
> >>> move operation we need. That is inefficient because
> >>> on POSIX platforms, we tend to fsync the same folder
> >>> more than once, on Windows we would need to sync
> >>> the files twice (content before and metadata after the
> >>> rename). Basically, missing the context info, we need
> >>> to play it safe and do much more work than we would
> >>> actually have to.
> >>>
> >>> In the future, we should implement step 1 as simple
> >>> non-fsync'ing file operations. Then explicitly sync every
> >>> file, and on POSIX the folders, once. Step 2 does not
> >>> have any atomicity requirements. Finally, do the 'current'
> >>> rename. This also only requires a single fsync b/c
> >>> the temp file will be in the same folder.
> >>>
> >>> On top of that, all operations in step 2 can be run
> >>> concurrently. I did that for FSX on Linux using aio_fsync
> >>> and it got 3x as fast. Windows can do something similar.
> >>> I wrapped that functionality into a "batch_fsync" object
> >>> with a few methods on it. You simply push paths into it,
> >>> it drops duplicates, and finally you ask it to fsync all.
> >>>
> >> I didn't find any documentation that calling FlushFileBuffers() on one
> >> handle flushes changes (data and metadata) made using other handle.
> >> I'm -1 to rely on this without official documentation proof. At least
> >> for FSFS.
> >
> > If you assume / suspect that FlushFileBuffers() only
> > operates on the open handle, i.e. only flushes those
> > changes made through that thandle, then you assume
> > that our commit process is seriously broken:
> >
> > For every PUT, we open the protorev file, append the
> > respective txdelta and close the file again. Since the
> > final flush uses yet another handle, this implies that
> > most of the revision data in each rev file does not get
> > fsync'ed and may be lost upon power failure.
> >
> > You might be right. So, if you care about repository
> > integrity, you should use your MSDN subscription and
> > ask MS for clarification on FlushFileBuffers() behaviour.
> You also may request MSDN subscription and ask MS for clarification or
> keep Windows code as it was before.
>

To be clear: You are proposing that the code on Windows
is fundamentally broken (revision contents not being
committed) while I think we "only" have a persistence
issue with renames. Since your business depends on
you being wrong, it would be in your best self-interest
to go and find out ...

Of course, I could apply for an MSDN subscription, wait
for it be approved etc. but I think it would be fairer if you
could check the Windows side of things while I try to get
some answers for POSIX.


>  > Things we would like to know:
> >
> > * Does FlushFileBuffers() also flush changes made to
> >   the same file through different handles? For simplification
> >   we may assume those other handles got closed and
> >   were owned by the same process.
> >
>
>
> > * Is calling FlushFileBuffers() on the target of a rename /
> >   move sufficient to flush all metadata? Does it also
> >   flush outstanding file content changes?
> Calling FlushFileBuffers() on target is not sufficient due the problem
> I described before [1]: metadata changes are journaled, while data
> changes seems are not. So you may get to race condition when move
> operation is recorded in journal, while new file content is not
> written to disk. On system restart journal will be applied, resulting
> empty/old content file in place. That's why source file should be
> flushed to disk before move operation.
>

That's all good and well but when will the metadata be flushed?

Also, we only need to be aware of that content / metadata race
for the "big switch-over" that e.g. makes the next revision
visible to all. Until then, we only care about whether all changes
have been written to to disk or not. If they haven't, we don't
care about specifics because nobody will read the partially
written data.

> * Is there a way to efficiently flush multiple files, e.g.
> >   through something like overlapped I/O?
> >
> > * Does passing the FILE_FLAG_WRITE_THROUGH and
> >   FILE_FLAG_NO_BUFFERING flags to CreateFile()
> >   guarantee that all contents has been stored on disk
> >   when CloseHandle() returns? (Assuming the HW does
> >   not lie about its write buffers).
> >
> FILE_FLAG_NO_BUFFERING is not related to disk caching: it's disable
> file buffer it require caller to perform only cluster aligned
> operations [2]
>

The source you are citing contradicts you:

  "For more information on how *FILE_FLAG_NO_BUFFERING*
  interacts with other cache-related flags, see *CreateFile*
<https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858>."

This flag does switch OS-side caching on and off.

Buffering *is* caching in Windows. For some unknown reason,
they called it "buffers" but then had to come up with some
term describing the entirety of buffers, aka "cache".


> With FILE_FLAG_WRITE_THROUGH flag disk cache is not used at all. I.e.
> changes goes directly to hardware with special bit to skip internal HW
> cache.


Yes. This controls the HW-side caching (but can be
overruled by global settings).


> Nothing is flushed to disk when CloseHandle() returns in this
> case.
>

I only added the "CloseHandle" part because that it what
we ultimately care about. Before that, we don't need data
to be persistent. So, if there was some magic to be applied,
I wanted to give as much leverage as possible.

 > Disclaimer: My understanding of the fsync behaviour
> > on Windows is based on conjecture, gathered from the
> > few pieces of information that I could find online. I'm
> > happy to change my mind once new evidence shows
> > up. Right now, our implementation seems to be wasteful
> > and possibly incomplete - which is worse. I would love
> > to fix both for 1.10.
> >
>
> [1] http://svn.haxx.se/dev/archive-2013-05/0245.shtml
> [2] https://msdn.microsoft.com/en-us/library/windows/desktop/cc644950
>
>
I've been thinking about how to implement a batch
fsync feature with few platform specifics and next
to no overhead. I'll try to get the first bits committed
to FSX over the weekend.

-- Stefan^2.

Re: Efficient and effective fsync during commit

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On 29 May 2015 at 18:55, Stefan Fuhrmann <st...@wandisco.com> wrote:
> On Fri, May 29, 2015 at 4:14 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
>> On 28 May 2015 at 20:47, Stefan Fuhrmann <st...@wandisco.com> wrote:
>>> Hi all,
>>>
>>> Most of us would agree that way we fsync FS changes
>>> in FSFS and FSX is slow (~10 commits / sec on a SSD,
>>> YMMV) and not even all changes are fully fsync'ed
>>> (repo creation, upgrade).
>>>
>> The first question is it really a problem?
>
> Recently, we had customers wondering why their servers
> wouldn't serve more than 20 commits/s (even on enterprise
> SSDs and with various OS file system tuning options).
> With QA bots constantly creating snapshots and tags,
> there isn't too much head room anymore.
>
Ack. Was it Windows or Linux?

>> I mean that usually commits
>> are not that often. They are maintenance tasks like 'svnadmin load'
>> that perform commits very often, but it could be fixed with
>> '--fsfs-no-sync' option to 'svnadmin load' like we had for BDB.
>
> That would be a workable approach. Adding a bunch
> of if statements.
Something like 5-15..

> It would not help for svnsync, though.
Agree. We could expose this option in fsfs.conf, like we had in BDB
but it would dangerous for production use even with properly power
loss protected hardware.

>>> From a high-level perspective, a commit is are simple
>>> 3-step process:
>>>
>>> 1. Write rev contents & props to their final location.
>>>    They are not accessible until 'current' gets bumped.
>>>    Write a the new 'current' temporary contents to a temp file.
>>> 2. Fsync everything we wrote in step 1 to disk.
>>>    Still not visible to any other FS API user.
>>> 3. Atomically switch to the new 'current' contents and
>>>    fsync that change.
>>>
>>> Today, we fsync "locally" as part of whatever copy or
>>> move operation we need. That is inefficient because
>>> on POSIX platforms, we tend to fsync the same folder
>>> more than once, on Windows we would need to sync
>>> the files twice (content before and metadata after the
>>> rename). Basically, missing the context info, we need
>>> to play it safe and do much more work than we would
>>> actually have to.
>>>
>>> In the future, we should implement step 1 as simple
>>> non-fsync'ing file operations. Then explicitly sync every
>>> file, and on POSIX the folders, once. Step 2 does not
>>> have any atomicity requirements. Finally, do the 'current'
>>> rename. This also only requires a single fsync b/c
>>> the temp file will be in the same folder.
>>>
>>> On top of that, all operations in step 2 can be run
>>> concurrently. I did that for FSX on Linux using aio_fsync
>>> and it got 3x as fast. Windows can do something similar.
>>> I wrapped that functionality into a "batch_fsync" object
>>> with a few methods on it. You simply push paths into it,
>>> it drops duplicates, and finally you ask it to fsync all.
>>>
>> I didn't find any documentation that calling FlushFileBuffers() on one
>> handle flushes changes (data and metadata) made using other handle.
>> I'm -1 to rely on this without official documentation proof. At least
>> for FSFS.
>
> If you assume / suspect that FlushFileBuffers() only
> operates on the open handle, i.e. only flushes those
> changes made through that thandle, then you assume
> that our commit process is seriously broken:
>
> For every PUT, we open the protorev file, append the
> respective txdelta and close the file again. Since the
> final flush uses yet another handle, this implies that
> most of the revision data in each rev file does not get
> fsync'ed and may be lost upon power failure.
>
> You might be right. So, if you care about repository
> integrity, you should use your MSDN subscription and
> ask MS for clarification on FlushFileBuffers() behaviour.
You also may request MSDN subscription and ask MS for clarification or
keep Windows code as it was before.

> Things we would like to know:
>
> * Does FlushFileBuffers() also flush changes made to
>   the same file through different handles? For simplification
>   we may assume those other handles got closed and
>   were owned by the same process.
>


> * Is calling FlushFileBuffers() on the target of a rename /
>   move sufficient to flush all metadata? Does it also
>   flush outstanding file content changes?
Calling FlushFileBuffers() on target is not sufficient due the problem
I described before [1]: metadata changes are journaled, while data
changes seems are not. So you may get to race condition when move
operation is recorded in journal, while new file content is not
written to disk. On system restart journal will be applied, resulting
empty/old content file in place. That's why source file should be
flushed to disk before move operation.

>
> * Is there a way to efficiently flush multiple files, e.g.
>   through something like overlapped I/O?
>
> * Does passing the FILE_FLAG_WRITE_THROUGH and
>   FILE_FLAG_NO_BUFFERING flags to CreateFile()
>   guarantee that all contents has been stored on disk
>   when CloseHandle() returns? (Assuming the HW does
>   not lie about its write buffers).
>
FILE_FLAG_NO_BUFFERING is not related to disk caching: it's disable
file buffer it require caller to perform only cluster aligned
operations [2]

With FILE_FLAG_WRITE_THROUGH flag disk cache is not used at all. I.e.
changes goes directly to hardware with special bit to skip internal HW
cache. Nothing is flushed to disk when CloseHandle() returns in this
case.

> Disclaimer: My understanding of the fsync behaviour
> on Windows is based on conjecture, gathered from the
> few pieces of information that I could find online. I'm
> happy to change my mind once new evidence shows
> up. Right now, our implementation seems to be wasteful
> and possibly incomplete - which is worse. I would love
> to fix both for 1.10.
>

[1] http://svn.haxx.se/dev/archive-2013-05/0245.shtml
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/cc644950

-- 
Ivan Zhakov
CTO | VisualSVN | http://www.visualsvn.com

Re: Efficient and effective fsync during commit

Posted by Branko Čibej <br...@wandisco.com>.
On 29.05.2015 17:55, Stefan Fuhrmann wrote:
> If you assume / suspect that FlushFileBuffers() only operates on the
> open handle, i.e. only flushes those changes made through that thandle,

>From my dabbling with the Windows I/O stack and filesystems way back,
I'd say that flushing (and all other operations, really) are per-FCB.
The FCB (file control block) is a per-open-file unique structure deep in
the I/O stack that all file handles refer to. Any file handle that has
the necessary access and sharing rights to flush the file cache will
affect the cache state for all other file handles.

-- Brane

Re: Efficient and effective fsync during commit

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Fri, May 29, 2015 at 4:14 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> On 28 May 2015 at 20:47, Stefan Fuhrmann <st...@wandisco.com> wrote:
>> Hi all,
>>
>> Most of us would agree that way we fsync FS changes
>> in FSFS and FSX is slow (~10 commits / sec on a SSD,
>> YMMV) and not even all changes are fully fsync'ed
>> (repo creation, upgrade).
>>
> The first question is it really a problem?

Recently, we had customers wondering why their servers
wouldn't serve more than 20 commits/s (even on enterprise
SSDs and with various OS file system tuning options).
With QA bots constantly creating snapshots and tags,
there isn't too much head room anymore.

> I mean that usually commits
> are not that often. They are maintenance tasks like 'svnadmin load'
> that perform commits very often, but it could be fixed with
> '--fsfs-no-sync' option to 'svnadmin load' like we had for BDB.

That would be a workable approach. Adding a bunch
of if statements. It would not help for svnsync, though.

>> From a high-level perspective, a commit is are simple
>> 3-step process:
>>
>> 1. Write rev contents & props to their final location.
>>    They are not accessible until 'current' gets bumped.
>>    Write a the new 'current' temporary contents to a temp file.
>> 2. Fsync everything we wrote in step 1 to disk.
>>    Still not visible to any other FS API user.
>> 3. Atomically switch to the new 'current' contents and
>>    fsync that change.
>>
>> Today, we fsync "locally" as part of whatever copy or
>> move operation we need. That is inefficient because
>> on POSIX platforms, we tend to fsync the same folder
>> more than once, on Windows we would need to sync
>> the files twice (content before and metadata after the
>> rename). Basically, missing the context info, we need
>> to play it safe and do much more work than we would
>> actually have to.
>>
>> In the future, we should implement step 1 as simple
>> non-fsync'ing file operations. Then explicitly sync every
>> file, and on POSIX the folders, once. Step 2 does not
>> have any atomicity requirements. Finally, do the 'current'
>> rename. This also only requires a single fsync b/c
>> the temp file will be in the same folder.
>>
>> On top of that, all operations in step 2 can be run
>> concurrently. I did that for FSX on Linux using aio_fsync
>> and it got 3x as fast. Windows can do something similar.
>> I wrapped that functionality into a "batch_fsync" object
>> with a few methods on it. You simply push paths into it,
>> it drops duplicates, and finally you ask it to fsync all.
>>
> I didn't find any documentation that calling FlushFileBuffers() on one
> handle flushes changes (data and metadata) made using other handle.
> I'm -1 to rely on this without official documentation proof. At least
> for FSFS.

If you assume / suspect that FlushFileBuffers() only
operates on the open handle, i.e. only flushes those
changes made through that thandle, then you assume
that our commit process is seriously broken:

For every PUT, we open the protorev file, append the
respective txdelta and close the file again. Since the
final flush uses yet another handle, this implies that
most of the revision data in each rev file does not get
fsync'ed and may be lost upon power failure.

You might be right. So, if you care about repository
integrity, you should use your MSDN subscription and
ask MS for clarification on FlushFileBuffers() behaviour.
Things we would like to know:

* Does FlushFileBuffers() also flush changes made to
  the same file through different handles? For simplification
  we may assume those other handles got closed and
  were owned by the same process.

* Is calling FlushFileBuffers() on the target of a rename /
  move sufficient to flush all metadata? Does it also
  flush outstanding file content changes?

* Is there a way to efficiently flush multiple files, e.g.
  through something like overlapped I/O?

* Does passing the FILE_FLAG_WRITE_THROUGH and
  FILE_FLAG_NO_BUFFERING flags to CreateFile()
  guarantee that all contents has been stored on disk
  when CloseHandle() returns? (Assuming the HW does
  not lie about its write buffers).

Disclaimer: My understanding of the fsync behaviour
on Windows is based on conjecture, gathered from the
few pieces of information that I could find online. I'm
happy to change my mind once new evidence shows
up. Right now, our implementation seems to be wasteful
and possibly incomplete - which is worse. I would love
to fix both for 1.10.

-- Stefan^2.

Re: Efficient and effective fsync during commit

Posted by Ivan Zhakov <iv...@visualsvn.com>.
On 28 May 2015 at 20:47, Stefan Fuhrmann <st...@wandisco.com> wrote:
> Hi all,
>
> Most of us would agree that way we fsync FS changes
> in FSFS and FSX is slow (~10 commits / sec on a SSD,
> YMMV) and not even all changes are fully fsync'ed
> (repo creation, upgrade).
>
The first question is it really a problem? I mean that usually commits
are not that often. They are maintenance tasks like 'svnadmin load'
that perform commits very often, but it could be fixed with
'--fsfs-no-sync' option to 'svnadmin load' like we had for BDB.

> From a high-level perspective, a commit is are simple
> 3-step process:
>
> 1. Write rev contents & props to their final location.
>    They are not accessible until 'current' gets bumped.
>    Write a the new 'current' temporary contents to a temp file.
> 2. Fsync everything we wrote in step 1 to disk.
>    Still not visible to any other FS API user.
> 3. Atomically switch to the new 'current' contents and
>    fsync that change.
>
> Today, we fsync "locally" as part of whatever copy or
> move operation we need. That is inefficient because
> on POSIX platforms, we tend to fsync the same folder
> more than once, on Windows we would need to sync
> the files twice (content before and metadata after the
> rename). Basically, missing the context info, we need
> to play it safe and do much more work than we would
> actually have to.
>
> In the future, we should implement step 1 as simple
> non-fsync'ing file operations. Then explicitly sync every
> file, and on POSIX the folders, once. Step 2 does not
> have any atomicity requirements. Finally, do the 'current'
> rename. This also only requires a single fsync b/c
> the temp file will be in the same folder.
>
> On top of that, all operations in step 2 can be run
> concurrently. I did that for FSX on Linux using aio_fsync
> and it got 3x as fast. Windows can do something similar.
> I wrapped that functionality into a "batch_fsync" object
> with a few methods on it. You simply push paths into it,
> it drops duplicates, and finally you ask it to fsync all.
>
I didn't find any documentation that calling FlushFileBuffers() on one
handle flushes changes (data and metadata) made using other handle.
I'm -1 to rely on this without official documentation proof. At least
for FSFS.



-- 
Ivan Zhakov

Re: Efficient and effective fsync during commit

Posted by Philip Martin <ph...@wandisco.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

> On Thu, May 28, 2015 at 9:54 PM, Philip Martin
> <ph...@wandisco.com> wrote:
>>
>> fsync() works on file descriptors rather than files, do we need to keep
>> the original file descriptors open in order to fsync()?
>
> We could b/c there are at most 7 (4 files, 3 folders) of them for a
> FSFS commit, but this is not necessary. Since it would imply
> keeping them open during renames, we could no longer use
> plain APR calls - i.e. extra code churn.
>
> If your interpretation was correct, fsync'ing a directory would
> only work if you modified that directory file through its descriptor -
> which you simply can't. Also, it would mean that our protorev
> file handling was broken: We open & close that file for every
> PUT, re-open it during commit, append the structure data and
> fsync only through the last file handle.

It's not my interpretation as such, I just want us to be clear about
the assumptions we would be making.

I suppose it is possible that our protorev handling is broken on some
filesystems.  It is also possible that some filesystems handle
directories and files in totally different ways: some sort of COW tree
for directories and a list of blocks for files.  Using the behaviour of
fsync on directories is not necessarily a good way to predict the
behaviour of fsync on files.  There is no mention of directories in the
POSIX description of fsync, unlike that of open.

If we consider a directory fsync after a rename then there is more to do
than just identifying which disk blocks store the directory; the rename
may have affected two directories.  When the rename affects two
directories if fsync on one is to flush the other then the filesystem
must either do a complete metadata flush or store some sort of pointer
to the other directory.  I don't think any of this is a problem for our
current Subversion code but it does illustrate that directory fsync is
not necessarily a model for file fsync.

-- 
Philip Martin | Subversion Committer
WANdisco // *Non-Stop Data*

Re: Efficient and effective fsync during commit

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Thu, May 28, 2015 at 9:54 PM, Philip Martin
<ph...@wandisco.com> wrote:
> Stefan Fuhrmann <st...@wandisco.com> writes:
>
>> In the future, we should implement step 1 as simple
>> non-fsync'ing file operations. Then explicitly sync every
>> file, and on POSIX the folders, once. Step 2 does not
>> have any atomicity requirements. Finally, do the 'current'
>> rename. This also only requires a single fsync b/c
>> the temp file will be in the same folder.
>>
>> On top of that, all operations in step 2 can be run
>> concurrently. I did that for FSX on Linux using aio_fsync
>> and it got 3x as fast. Windows can do something similar.
>> I wrapped that functionality into a "batch_fsync" object
>> with a few methods on it. You simply push paths into it,
>> it drops duplicates, and finally you ask it to fsync all.
>
> fsync() works on file descriptors rather than files, do we need to keep
> the original file descriptors open in order to fsync()?

We could b/c there are at most 7 (4 files, 3 folders) of them for a
FSFS commit, but this is not necessary. Since it would imply
keeping them open during renames, we could no longer use
plain APR calls - i.e. extra code churn.

If your interpretation was correct, fsync'ing a directory would
only work if you modified that directory file through its descriptor -
which you simply can't. Also, it would mean that our protorev
file handling was broken: We open & close that file for every
PUT, re-open it during commit, append the structure data and
fsync only through the last file handle.

> The POSIX description is
>
> http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html
>
>   The fsync() function shall request that all data for the open file
>   descriptor named by fildes is to be transferred to the storage device
>   associated with the file described by fildes.  The nature of the
>   transfer is implementation-defined.

My reading of this paragraph is: "all data for the open file descriptor"
means "all data accessible through the open file descriptor", not
just the parts that it might have modified itself. The following
paragraph for serialized I/O makes this clearer:

> If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function
> shall force all currently queued I/O operations associated with the
> file indicated by file descriptor fildes to the synchronized I/O completion
> state. ...

So, it requires the file object to be flushed not just some sub-set
manipulated through the given descriptor. If _POSIX_SYNCHRONIZED_IO
is not defined, the behaviour is configuration dependent and fsync
may simply be ineffective.

-- Stefan^2.

Re: Efficient and effective fsync during commit

Posted by Philip Martin <ph...@wandisco.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

> In the future, we should implement step 1 as simple
> non-fsync'ing file operations. Then explicitly sync every
> file, and on POSIX the folders, once. Step 2 does not
> have any atomicity requirements. Finally, do the 'current'
> rename. This also only requires a single fsync b/c
> the temp file will be in the same folder.
>
> On top of that, all operations in step 2 can be run
> concurrently. I did that for FSX on Linux using aio_fsync
> and it got 3x as fast. Windows can do something similar.
> I wrapped that functionality into a "batch_fsync" object
> with a few methods on it. You simply push paths into it,
> it drops duplicates, and finally you ask it to fsync all.

fsync() works on file descriptors rather than files, do we need to keep
the original file descriptors open in order to fsync()?  The POSIX
description is

http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html

  The fsync() function shall request that all data for the open file
  descriptor named by fildes is to be transferred to the storage device
  associated with the file described by fildes.  The nature of the
  transfer is implementation-defined.

It probably is feasible to keep the file descriptors open, provided we
don't accumulate too many.

-- 
Philip Martin | Subversion Committer
WANdisco // *Non-Stop Data*