You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Paul Holden <pa...@gmail.com> on 2010/04/09 13:31:30 UTC

Severe performance issues with large directories

Hello,



I’ve had a look through the issue tracker and mailing list archives and
didn’t find any references to this issue. I also assume that this is a more
appropriate mailing list than 'users'.



We’ve noticed recently that we have terrible performance when updating a
particular directory in our repository. We’ve realised that the poor
performance is related to the fact that we have 5,800 or so files in a
single directory. (I know! This is far from ideal but we’re a long way into
development and reorganising the directory structure at this stage is very
difficult.)



To give some concrete numbers, we recently re-exported about 10,000 texture
files (averaging about 40KB each, or 390MB) and 5,800 shaders (averaging
about 4KB each, or 22MB total). Both these files are gzip compressed. Here
are some approximate times for ‘svn up’



Textures: 10,000 files, 390MB, ~4 minutes

Shaders: 5,800 files, 22MB, ~10 minutes



The key point here is that the textures are nicely distributed in a
well-organised directory structure, but the shaders are dumped into a single
directory.



The problem we face now is that we're iterating as lot on the engine, which
is causing us to rebuild the shaders every day.



To cut a long story short, I ran SysInternals procmon.exe while svn was
updating, and saw two alarming behaviours:



1) .svn\entries is being read in its entirety (in 4kb chunks) for *every*
file that’s updated in the directory. As the shaders dir contains so many
files, it’s approximately 1MB in size. That’s 5,800 reads of a 1MB file
(5.8GB in total) for a single update! I know this file is likely to be
cached by the OS, but that’s still a lot of unnecessary system calls and
memory being copied around. Please excuse my ignorance if there's a
compelling reason to re-read this file multiple times, but can't subversion
cache the contents of this file when it's updating the directory? Presumably
it's locked the directory at this point, so it can be confident that the
contents of this file won't be changed externally?



2) subversion appears to generate a temporary file in .svn\prop-base\ for
every file that's being updated. It's generating filenames sequentially,
which means that when 5,800 files are being updated it ends up doing this:



file_open tempfile.tmp? Already exists!

file_open tempfile.2.tmp? Already exists!

file_open tempfile.3.tmp? Already exists!

...some time later

file_open tempfile.5800.tmp? Yes!



For N files in a directory, that means subversion ends up doing (n^2 + n)/2
calls to file_open. In our case that means it's testing for file existence
16,822,900 times (!) in order to do a full update. Even with just 100 files
in a directory that's 5,050 tests.



Is there any inherent reason these files need to be generated sequentially?
>From reading the comments in 'svn_io_open_uniquely_named' it sounds like
these files are named sequentially for the benefit of people looking at
conflicts in their working directory. As these files are being generated
within the 'magic' .svn folder, is there any reason to number them
sequentially? Just calling rand() until there were no collisions would
probably give a huge increase in performance.





I appreciate that we're probably an edge case with ~6000 files, but it seems
that issue 2) is a relatively straightforward change which would yield clear
benefits even for more sane repositories (and across all platforms too).



In case it's relevant, I'm using the CollabNet build of subversion on
Windows 7 64bit. Here's 'svn --version':

C:\dev\CW_br2>svn --version

svn, version 1.6.6 (r40053)

   compiled Oct 19 2009, 09:36:48



Thanks,

Paul

Re: Severe performance issues with large directories

Posted by Paul Holden <pa...@gmail.com>.
On 9 Apr 2010 20:59, "Greg Stein" <gs...@gmail.com> wrote:

> Not a new API. Just a revamp of svn_io_open_unique_file3. Compare the
> 1.6.x version against trunk.

> Paul: if you're adventurous, and build your own svn, you could try
> lifting the contents of trunk's svn_io_open_unique_file3, and dropping
> that into your 1.6.x.

Thanks Greg. One of my colleagues is compiling from source, so he's
going to take a look at this. I'll let you know how we get on.

Cheers,
Paul

Re: Severe performance issues with large directories

Posted by Mark Phippard <ma...@gmail.com>.
On Fri, Apr 9, 2010 at 3:59 PM, Greg Stein <gs...@gmail.com> wrote:
>> Are you sure that this is in prop-base, not .svn/tmp?
>
> Whatever. Does it matter? :-P
>
> I think we should backport the changes we made in svn_io_open_unique_file3.
>
>> For 1.7 we made the tempfilename generator better in guessing new names, but
>> for property handling we won't be using files in 1.7. (Looking at these
>> numbers and those that follow later in your mail, we might have to look in
>> porting some of this back to 1.6).
>
> Yah.

I recall we fixed a similar problem on large commits when generating
temp files.  This was fixed in 1.6.3:

http://svn.apache.org/viewvc?view=revision&revision=878046

It looks like the solution was to avoid using
svn_io_open_unique_file3.  Not sure if a similar workaround could be
used here.

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

Re: Severe performance issues with large directories

Posted by Greg Stein <gs...@gmail.com>.
On Fri, Apr 9, 2010 at 08:27, Bert Huijben <be...@qqmail.nl> wrote:
>> -----Original Message-----
>> From: Paul Holden [mailto:paul.holden@gmail.com]
>...
>> 2) subversion appears to generate a temporary file in .svn\prop-base\ for
>> every file that's being updated. It's generating filenames sequentially,
>> which means that when 5,800 files are being updated it ends up doing this:
>>
>>
>> file_open tempfile.tmp? Already exists!
>>
>> file_open tempfile.2.tmp? Already exists!
>>
>> file_open tempfile.3.tmp? Already exists!
>>
>> ...some time later
>>
>> file_open tempfile.5800.tmp? Yes!
>
> Wow.

Yeah, Wow. And I'll raise you an Ugh.

> Are you sure that this is in prop-base, not .svn/tmp?

Whatever. Does it matter? :-P

I think we should backport the changes we made in svn_io_open_unique_file3.

> For 1.7 we made the tempfilename generator better in guessing new names, but
> for property handling we won't be using files in 1.7. (Looking at these
> numbers and those that follow later in your mail, we might have to look in
> porting some of this back to 1.6).

Yah.

>...
>> Is there any inherent reason these files need to be generated
> sequentially?
>> From reading the comments in 'svn_io_open_uniquely_named' it sounds like
>> these files are named sequentially for the benefit of people looking at
>> conflicts in their working directory. As these files are being generated
>> within the 'magic' .svn folder, is there any reason to number them
>> sequentially? Just calling rand() until there were no collisions would
>> probably give a huge increase in performance.
>
> In 1.7 we have a new api that uses a smarter algorithm, but we can't add
> public apis to 1.6 now.

Not a new API. Just a revamp of svn_io_open_unique_file3. Compare the
1.6.x version against trunk.

Paul: if you're adventurous, and build your own svn, you could try
lifting the contents of trunk's svn_io_open_unique_file3, and dropping
that into your 1.6.x.

>...

Cheers,
-g

RE: Severe performance issues with large directories

Posted by Geoff Rowell <ge...@gmail.com>.
> -----Original Message-----
> From: Bert Huijben [mailto:bert@qqmail.nl] 
> Sent: Friday, April 09, 2010 8:27 AM
> To: 'Paul Holden'; dev@subversion.apache.org
> Subject: RE: Severe performance issues with large directories
> 
> This issue is actually worse on Windows then on linux, because NTFS is a
> fully transactional filesystem with a more advanced locking handling. And
> for this it needs to do more to open a file. (Some tests I performed 1.5
> year ago indicated that NTFS is more than 100 times slower on handling
> extremely small files, then the EXT3 filesystem on Linux. While througput
> within a single file is not far apart).

I've had to go to some lengths to deal with poor performance of large
working copy folders under Windows.

I've got working copies containing media snippets that are shared between
Solaris and Windows systems. Since the layout was originally designed for
the Solaris systems, it had no problem with many thousands of files in a
directory. Under Windows, a checkout of one of these folders will start out
zipping along, but will eventually slow down to a crawl. Left to itself,
it'll take several days to complete. (Under Linux, this takes a couple
hours.)

And before someone tries to claim it - this has nothing to do with virus
checking. I don't install virus checking until after I've set up the working
copies.

In order to get a new Windows working copy, or to apply a major addition,
we've taken to doing an update/checkout on a Linux system, archiving the
working copy and extracting it onto the Windows system. Obviously, not our
preferred method.

Any performance testing for Subversion should include testing under Windows.
Theological discussions aside, it's an important market segment.
---
Geoff Rowell
geoff.rowell@gmail.com



Re: Severe performance issues with large directories

Posted by Paul Holden <pa...@gmail.com>.
Hi Bert,

Many thanks for the quick response.

> I think you can find a lot of issues similar to your issue in our issue
> tracker.

Searching fail on my part - sorry for re-treading old ground.

> For WC-NG we move all the entries data in a single wc.db file in a .svn
> directory below the root of your working copy. This database is accessed via
> SQLite, so it doesn't need the chunked rewriting or anything of that. (It
> even has in-memory caching and transaction handling, so we don't have to do
> that in Subversion itself any more)

Sounds great. We have quite a deep, dense directory structure and so a
full update (or any walk over the whole working copy) involves
accessing hundreds of subdirectories. Merging is particularly
paintful. I imagine this could help a great deal.

> > 2) subversion appears to generate a temporary file in .svn\prop-base\ for
> > every file that's being updated. It's generating filenames sequentially,
> > which means that when 5,800 files are being updated it ends up doing this:
> >
> > file_open tempfile.tmp? Already exists!
> > file_open tempfile.2.tmp? Already exists!
> > file_open tempfile.3.tmp? Already exists!
> > ...some time later
> > file_open tempfile.5800.tmp? Yes!
>
> Wow.
>
> Are you sure that this is in prop-base, not .svn/tmp?

Yes, definitely. Each of these files have a svn:mime-type property of
'application/octet-stream', so I guess it's that (the property isn't
changing between updates however)

> For 1.7 we made the tempfilename generator better in guessing new names, but
> for property handling we won't be using files in 1.7. (Looking at these
> numbers and those that follow later in your mail, we might have to look in
> porting some of this back to 1.6).

I'd love to see this in 1.6, as it's biting us quite hard right now -
to the extent that we're seriously discussing moving this stuff out of
version control (which is terrifying). I'm sure we'll switch over to
1.7 as soon as we can however.

> Properties will be moved in wc.db, to remove the file accesses completely.
> (We can update them with the node information in a single transaction;
> without additional file accesses)

Again, sounds great :)

> > Is there any inherent reason these files need to be generated
> sequentially?
> > From reading the comments in 'svn_io_open_uniquely_named' it sounds like
> > these files are named sequentially for the benefit of people looking at
> > conflicts in their working directory. As these files are being generated
> > within the 'magic' .svn folder, is there any reason to number them
> > sequentially? Just calling rand() until there were no collisions would
> > probably give a huge increase in performance.
>
> In 1.7 we have a new api that uses a smarter algorithm, but we can't add
> public apis to 1.6 now.

It's a shame that the api would need to change to support this. I
suppose checking to see if the tempfile was being generated under
'.svn/prop-base' and using an alternative strategy is too gross? (I'm
half joking)

> > In case it's relevant, I'm using the CollabNet build of subversion on
> > Windows 7 64bit. Here's 'svn --version':
> >
> > C:\dev\CW_br2>svn --version
>
> This issue is actually worse on Windows then on linux, because NTFS is a
> fully transactional filesystem with a more advanced locking handling. And
> for this it needs to do more to open a file. (Some tests I performed 1.5
> year ago indicated that NTFS is more than 100 times slower on handling
> extremely small files, then the EXT3 filesystem on Linux. While througput
> within a single file is not far apart).

Yeah - we're seeing the same issue on some of our Linux boxes. The
problem is still there, but it's not as severe.

Many thanks,
Paul

RE: Severe performance issues with large directories

Posted by Bert Huijben <be...@qqmail.nl>.

> -----Original Message-----
> From: Paul Holden [mailto:paul.holden@gmail.com]
> Sent: vrijdag 9 april 2010 13:32
> To: dev@subversion.apache.org
> Subject: Severe performance issues with large directories
> 
> Hello,
> 
> 
> 
> I've had a look through the issue tracker and mailing list archives and
> didn't find any references to this issue. I also assume that this is a
more
> appropriate mailing list than 'users'.

I think you can find a lot of issues similar to your issue in our issue
tracker.

For Subversion 1.7 we are rewriting the entire working copy library to use a
database which should resolve most of your issues. (And which will allow us
to resolve more issues in future versions). The issues related to this
rewrite have the 'WC-NG' name somewhere.

> We've noticed recently that we have terrible performance when updating a
> particular directory in our repository. We've realised that the poor
> performance is related to the fact that we have 5,800 or so files in a
> single directory. (I know! This is far from ideal but we're a long way
into
> development and reorganising the directory structure at this stage is very
> difficult.)
> 
That is certainly a number of files where the entries handling in the
current wc library will be slow. (Confirming your findings later in your
mail)
 
> To give some concrete numbers, we recently re-exported about 10,000
> texture
> files (averaging about 40KB each, or 390MB) and 5,800 shaders (averaging
> about 4KB each, or 22MB total). Both these files are gzip compressed. Here
> are some approximate times for 'svn up'
> 
> 
> 
> Textures: 10,000 files, 390MB, ~4 minutes
> 
> Shaders: 5,800 files, 22MB, ~10 minutes
> 
> 
> 
> The key point here is that the textures are nicely distributed in a
> well-organised directory structure, but the shaders are dumped into a
single
> directory.
> 
> 
> 
> The problem we face now is that we're iterating as lot on the engine,
which
> is causing us to rebuild the shaders every day.
> 
> 
> 
> To cut a long story short, I ran SysInternals procmon.exe while svn was
> updating, and saw two alarming behaviours:
> 
> 
> 
> 1) .svn\entries is being read in its entirety (in 4kb chunks) for *every*
> file that's updated in the directory. As the shaders dir contains so many
> files, it's approximately 1MB in size. That's 5,800 reads of a 1MB file
> (5.8GB in total) for a single update! I know this file is likely to be
> cached by the OS, but that's still a lot of unnecessary system calls and
> memory being copied around. Please excuse my ignorance if there's a
> compelling reason to re-read this file multiple times, but can't
subversion
> cache the contents of this file when it's updating the directory?
Presumably
> it's locked the directory at this point, so it can be confident that the
> contents of this file won't be changed externally?

For WC-NG we move all the entries data in a single wc.db file in a .svn
directory below the root of your working copy. This database is accessed via
SQLite, so it doesn't need the chunked rewriting or anything of that. (It
even has in-memory caching and transaction handling, so we don't have to do
that in Subversion itself any more)

> 2) subversion appears to generate a temporary file in .svn\prop-base\ for
> every file that's being updated. It's generating filenames sequentially,
> which means that when 5,800 files are being updated it ends up doing this:
> 
> 
> 
> file_open tempfile.tmp? Already exists!
> 
> file_open tempfile.2.tmp? Already exists!
> 
> file_open tempfile.3.tmp? Already exists!
> 
> ...some time later
> 
> file_open tempfile.5800.tmp? Yes!

Wow.

Are you sure that this is in prop-base, not .svn/tmp?

For 1.7 we made the tempfilename generator better in guessing new names, but
for property handling we won't be using files in 1.7. (Looking at these
numbers and those that follow later in your mail, we might have to look in
porting some of this back to 1.6).

Properties will be moved in wc.db, to remove the file accesses completely.
(We can update them with the node information in a single transaction;
without additional file accesses)

> For N files in a directory, that means subversion ends up doing (n^2 +
n)/2
> calls to file_open. In our case that means it's testing for file existence
> 16,822,900 times (!) in order to do a full update. Even with just 100
files
> in a directory that's 5,050 tests.
> 
> 
> 
> Is there any inherent reason these files need to be generated
sequentially?
> From reading the comments in 'svn_io_open_uniquely_named' it sounds like
> these files are named sequentially for the benefit of people looking at
> conflicts in their working directory. As these files are being generated
> within the 'magic' .svn folder, is there any reason to number them
> sequentially? Just calling rand() until there were no collisions would
> probably give a huge increase in performance.

In 1.7 we have a new api that uses a smarter algorithm, but we can't add
public apis to 1.6 now.

> I appreciate that we're probably an edge case with ~6000 files, but it
seems
> that issue 2) is a relatively straightforward change which would yield
clear
> benefits even for more sane repositories (and across all platforms too).
> 
> 
> 
> In case it's relevant, I'm using the CollabNet build of subversion on
> Windows 7 64bit. Here's 'svn --version':
> 
> C:\dev\CW_br2>svn --version

This issue is actually worse on Windows then on linux, because NTFS is a
fully transactional filesystem with a more advanced locking handling. And
for this it needs to do more to open a file. (Some tests I performed 1.5
year ago indicated that NTFS is more than 100 times slower on handling
extremely small files, then the EXT3 filesystem on Linux. While througput
within a single file is not far apart).

	Bert