You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Ivan Zhakov <iv...@visualsvn.com> on 2011/02/20 09:50:09 UTC

Re: FSFS format 6

On Wed, Dec 29, 2010 at 22:37, Stefan Fuhrmann <eq...@web.de> wrote:
> The fopen() calls should be eliminated by the
> file handle cache. IOW, they should already be
> addressed on the performance branch. Please
> let me know if that is not the case.
>
Just my 20 cents.

My belief that file handles cache should be implemented at OS level
and I pretty sure that it's implemented. And right way to eliminate
number of duplicate fopen()/reads() is improving our FS API.

I didn't reviewed how file handles cache is implemented in
fs-performance branch, but I'm nearly to -1 against implementing cache
of open file handles in Subversion.

-- 
Ivan Zhakov

Re: FSFS format 6

Posted by Stefan Fuhrmann <eq...@web.de>.
On 20.02.2011 09:50, Ivan Zhakov wrote:
> On Wed, Dec 29, 2010 at 22:37, Stefan Fuhrmann<eq...@web.de>  wrote:
>> The fopen() calls should be eliminated by the
>> file handle cache. IOW, they should already be
>> addressed on the performance branch. Please
>> let me know if that is not the case.
>>
> Just my 20 cents.
High roller.
> My belief that file handles cache should be implemented at OS level
> and I pretty sure that it's implemented.
You can certainly data to demonstrate your claim?

In fact, fopen() is extremely expensive (1..5ms) on FS with
ACLs. Even for a local, low overhead (EXT3) FS, the effect
of handle caching is significant:

time ./svnadmin verify $TSVN_MIRROR -q -F 256 -M 0
real   1m46.603s
user   1m43.474s
sys    0m3.132s

time ./svnadmin verify $TSVN_MIRROR -q -F 0 -M 0
real   2m26.664s
user   2m0.856s
sys    0m25.818s

Note that the gains are split about 50:50 between the OS
and the application. Things become even more interesting
albeit less easily demonstrable with concurrent queries
being run by a threaded server. One would expect a even
higher level of reuse.
> And right way to eliminate
> number of duplicate fopen()/reads() is improving our FS API.
Why would that be necessary if the OS already takes care
of all the optimizations?

FSFS6 is about optimizing the interface between OS and
the FSFS code: Fewer seek()s and drastically reduced
number of read()s.

Once that is in place and its behavior well understood, we
may start designing I/O aggregation and scheduling. In
particular holding off requests while another request already
fetches the desired data, will be a very interesting task

 From what I understood of the FS API there is very little
that needed to be added to allow for effective I/O optimization.
Basically, I simple "advise" or "prefetch" option on the
read functions could possibly do the trick.

If we get to that stage, I'm sure to receive "the OS should
take care of I/O scheduling and stuff" posts.
> I didn't reviewed how file handles cache is implemented in
> fs-performance branch, but I'm nearly to -1 against implementing cache
> of open file handles in Subversion.
File handle caching definitely has its drawbacks, risks
in particular. The number of file handles within an OS
instance is quite limited (typ. 1000) and open files may
prevent file deletion (e.g. during packing). The code is
supposed to take care of the latter but may be faulty.

Alternative designs are welcome.

-- Stefan^2.

Re: FSFS format 6

Posted by Stefan Fuhrmann <eq...@web.de>.
On 20.02.2011 21:02, Johan Corveleyn wrote:
> On Sun, Feb 20, 2011 at 6:35 PM, Mark Mielke<ma...@mark.mielke.cc>  wrote:
>
>> That said, I'm also (in principle) against implementing cache of open file
>> handles. I prefer architectures that cache intermediate data in a processed
>> form that the application has made a determined choice to make use of such
>> that the cache is the most useful to the application, rather than a
>> transparent caching layer that guesses at what is safe. The OS file system
>> layer is exactly this - any caching it does is transparent to the
>> application and a guess. Guesses are dangerous, which is exactly why the OS
>> file system layer cannot do as much caching unless it has 100% control of
>> the file system (= local file system).
Agreed. For that very reason, I added extensive
caching to the FSFS code and got even more of that
in the pipeline for 1.8.

That being said, there are still typical situations in
which the data cache may not be effective:

* access to relatively rarely read data
   (log, older tags;
    you still want to perform decently in that case)
* first access to the latest revision
   (due to the way transactions are implemented,
    it is difficult to fill all the caches upon write)
* amount of active data > available RAM
   (throws you back to the first issue more often)

> I agree that it would be best if the architecture was so that svn
> could organize its work for most use cases in a way that's efficient
> for the lower levels of the system. For instance, for "svn log", svn
> should in theory be able to do its work with exactly 1 open/close per
> rev file (or in a packed repository, maybe even only 1 open/close per
> packed file).
Yes, it may be very hard to anticipate what data may
be needed further down the road, even if we had a
marvelous "1 query gets it all" interface where feasible:
svn log, for instance, is often run with a limit on the number
of results. However, there is no way to tell how much of
a packed file needs to be read to process that query.
There is only a lower bound.

So, it can be very beneficial to keep a small number of
file handles around to "bridge" various stages / iterations
within a single request.
> But right now, this isn't the case, and I think it would be a huge
> amount of work, change in architecture, layering, ... Until that
> happens, I think such a generic file-handle caching layer could prove
> very helpful :-). Note though that, if I understood correctly, the
> file-handle caching of the performance branch will not be reintegrated
> into 1.7, but maybe 1.8 ...
>
> But maybe stefan2 can comment more on that :-).
Because keeping file open for a potentially much
longer period of time may have an impact on other,
rarely run operations like pack, I don't think we should
risk merging this into 1.7.

-- Stefan^2.

Re: FSFS format 6

Posted by Johan Corveleyn <jc...@gmail.com>.
On Sun, Feb 20, 2011 at 6:35 PM, Mark Mielke <ma...@mark.mielke.cc> wrote:
> On 02/20/2011 03:50 AM, Ivan Zhakov wrote:
>>
>> On Wed, Dec 29, 2010 at 22:37, Stefan Fuhrmann<eq...@web.de>  wrote:
>>>
>>> The fopen() calls should be eliminated by the
>>> file handle cache. IOW, they should already be
>>> addressed on the performance branch. Please
>>> let me know if that is not the case.
>>
>> My belief that file handles cache should be implemented at OS level
>> and I pretty sure that it's implemented. And right way to eliminate
>> number of duplicate fopen()/reads() is improving our FS API.
>>
>> I didn't reviewed how file handles cache is implemented in
>> fs-performance branch, but I'm nearly to -1 against implementing cache
>> of open file handles in Subversion.
>
> What OS implements file handle caching? The OS file system layer for most
> operating systems does implement caching - but open()/close() can easily
> invalidate some or all of this cache due to required POSIX behaviour,
> especially if the backend storage is remote and shared between multiple
> clients such as would be the case over NFS. This is required to implement
> consistency across clients. The local operating system cannot arbitrarily
> cache everything, and every bit of data it does decide to cache could be
> wrong at any point in time without other aspects in use such as file
> locking.
>
> Of particular concern to me is how slow Subversion gets over NFS, and this
> thread grabbed my attention as a result. When using NFS Subversion
> operations can take many times longer (20 seconds -> 20 minutes). I think
> people may be testing and making assumptions that a "local file system" will
> be in use. Do people working on the fs-performance branch check with NFS?
>
> I don't know... just dropping in... feel free to set me straight. :-)

Hi Mark,

You're absolutely right, some Subversion operations perform horribly
with FSFS over NFS (we have such a setup @work). In fact, the poor
performance of e.g. "svn log somefile" on NFS was one of the problems
I was first interested in when looking at svn (and one of the reasons
I got involved with svn development, a positive side-effect :-)).

On our setup at work, "svn log" is about 10 times slower when done
over NFS than on local disk. As I described in this thread (but also
some threads before), "svn log somefile" opens and closes each rev
file about 20 times (and the situation is not better with a packed
repository, because the packed file is opened/closed just as many
times), and it seems that is very expensive when working over NFS.

I haven't been able to test the performance branch (with the file
handle caching) on our NFS setup at work. I have only measured the
number of fopen() calls for an "svn log" operation, compared to trunk,
assuming that is *the* most critical performance differentiator for
NFS setups.

If someone could do some real measurements/benchmarks of "svn log"
(and other operations of course) of the performance branch on an NFS
setup, compared with trunk (and maybe also compare them with a similar
setup with FSFS on local disk), that could be very interesting...

> That said, I'm also (in principle) against implementing cache of open file
> handles. I prefer architectures that cache intermediate data in a processed
> form that the application has made a determined choice to make use of such
> that the cache is the most useful to the application, rather than a
> transparent caching layer that guesses at what is safe. The OS file system
> layer is exactly this - any caching it does is transparent to the
> application and a guess. Guesses are dangerous, which is exactly why the OS
> file system layer cannot do as much caching unless it has 100% control of
> the file system (= local file system).

I agree that it would be best if the architecture was so that svn
could organize its work for most use cases in a way that's efficient
for the lower levels of the system. For instance, for "svn log", svn
should in theory be able to do its work with exactly 1 open/close per
rev file (or in a packed repository, maybe even only 1 open/close per
packed file).

But right now, this isn't the case, and I think it would be a huge
amount of work, change in architecture, layering, ... Until that
happens, I think such a generic file-handle caching layer could prove
very helpful :-). Note though that, if I understood correctly, the
file-handle caching of the performance branch will not be reintegrated
into 1.7, but maybe 1.8 ...

But maybe stefan2 can comment more on that :-).

Cheers,
-- 
Johan

Re: FSFS format 6

Posted by Mark Mielke <ma...@mark.mielke.cc>.
On 02/20/2011 03:50 AM, Ivan Zhakov wrote:
> On Wed, Dec 29, 2010 at 22:37, Stefan Fuhrmann<eq...@web.de>  wrote:
>> The fopen() calls should be eliminated by the
>> file handle cache. IOW, they should already be
>> addressed on the performance branch. Please
>> let me know if that is not the case.
> My belief that file handles cache should be implemented at OS level
> and I pretty sure that it's implemented. And right way to eliminate
> number of duplicate fopen()/reads() is improving our FS API.
>
> I didn't reviewed how file handles cache is implemented in
> fs-performance branch, but I'm nearly to -1 against implementing cache
> of open file handles in Subversion.

What OS implements file handle caching? The OS file system layer for 
most operating systems does implement caching - but open()/close() can 
easily invalidate some or all of this cache due to required POSIX 
behaviour, especially if the backend storage is remote and shared 
between multiple clients such as would be the case over NFS. This is 
required to implement consistency across clients. The local operating 
system cannot arbitrarily cache everything, and every bit of data it 
does decide to cache could be wrong at any point in time without other 
aspects in use such as file locking.

Of particular concern to me is how slow Subversion gets over NFS, and 
this thread grabbed my attention as a result. When using NFS Subversion 
operations can take many times longer (20 seconds -> 20 minutes). I 
think people may be testing and making assumptions that a "local file 
system" will be in use. Do people working on the fs-performance branch 
check with NFS?

I don't know... just dropping in... feel free to set me straight. :-)

That said, I'm also (in principle) against implementing cache of open 
file handles. I prefer architectures that cache intermediate data in a 
processed form that the application has made a determined choice to make 
use of such that the cache is the most useful to the application, rather 
than a transparent caching layer that guesses at what is safe. The OS 
file system layer is exactly this - any caching it does is transparent 
to the application and a guess. Guesses are dangerous, which is exactly 
why the OS file system layer cannot do as much caching unless it has 
100% control of the file system (= local file system).

Cheers,
mark

-- 
Mark Mielke<ma...@mielke.cc>