You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by "Vyacheslav V. Zholudev" <vy...@gmail.com> on 2008/06/10 11:42:49 UTC

skels in string table

Hello!

I'm digging into BDB backend and thought that strings "table" can 
contain only either full text or deltas, but I found out that smth like:

 "((ins.xml 5 6.0.7) (install_log.xml 5 4.0.5) (cp.xml 5 7.0.8))"

is written to the strings table as a fulltext, which reminds me some 
skel. Or is it smth else? If yes, how can i distinguish whether the full 
content of file is written or smth else?
Thanks in advance!

Best,
Vyacheslav


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by Blair Zajac <bl...@orcaware.com>.
Mark Phippard wrote:
> On Wed, Jun 11, 2008 at 5:57 PM, Blair Zajac <bl...@orcaware.com> wrote:
>> Yes, if you have 20,000 entries in a single directory, then in an fsfs
>> backend, a single modification to one of these entries or a child ends up
>> creating a 700 kByte revision.  Ouch!
>>
>> In our svn backend, we're now md5 hashing entries into a one directory deep
>> hash buckets, using 30 buckets per directory.
> 
> This is a custom Subversion filesystem?

No, just a normal Subversion filesystem.  I wrote an RPC layer using Ice to 
expose svn_fs.h over the network.  The RPC layer splits input paths on /'s, for 
each path element prepends one new directory with a name based on the md5 of the 
path element and then joins all of them back together before using any functions 
in svn_fs.h.

Blair


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by Mark Phippard <ma...@gmail.com>.
On Wed, Jun 11, 2008 at 5:57 PM, Blair Zajac <bl...@orcaware.com> wrote:
> Yes, if you have 20,000 entries in a single directory, then in an fsfs
> backend, a single modification to one of these entries or a child ends up
> creating a 700 kByte revision.  Ouch!
>
> In our svn backend, we're now md5 hashing entries into a one directory deep
> hash buckets, using 30 buckets per directory.

This is a custom Subversion filesystem?

-- 
Thanks

Mark Phippard
http://markphip.blogspot.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by Blair Zajac <bl...@orcaware.com>.
Ben Collins-Sussman wrote:
> On Wed, Jun 11, 2008 at 4:45 PM, Karl Fogel <kf...@red-bean.com> wrote:
>> "Vyacheslav V. Zholudev" <vy...@gmail.com> writes:
>>> Hello!
>>>
>>> I'm digging into BDB backend and thought that strings "table" can
>>> contain only either full text or deltas, but I found out that smth
>>> like:
>>>
>>> "((ins.xml 5 6.0.7) (install_log.xml 5 4.0.5) (cp.xml 5 7.0.8))"
>>>
>>> is written to the strings table as a fulltext, which reminds me some
>>> skel. Or is it smth else? If yes, how can i distinguish whether the
>>> full content of file is written or smth else?
>>> Thanks in advance!
>> Vyacheslav,
>>
>> Since directories are just lists of entries anyway, we represent them
>> using the same skel syntax used for reps metadata -- so that's how
>> skel-like data ends up in the strings table.
>>
>> Each subcell above is a directory name followed by the name of its node
>> revision (the name is a string, so the length comes first, followed by a
>> space; see subversion/libsvn_fs_base/util/skel.h for details, but you've
>> probably already read that file :-) ).
> 
> Alas, this design doesn't scale well when you have versioned
> directories with thousands of children.  Every attempt to add, delete,
> or modify a child in the directory causes the *entire* list of dirents
> to be loaded into memory, then serialized back to disk again.  As the
> number of children in the directory increases, it becomes O(N^2) to to
> modify them.
> 
> I'm dreaming of fixing this someday... perhaps by splitting up a
> directory's dirents across multiple string-keys.

Yes, if you have 20,000 entries in a single directory, then in an fsfs backend, 
a single modification to one of these entries or a child ends up creating a 700 
kByte revision.  Ouch!

In our svn backend, we're now md5 hashing entries into a one directory deep hash 
buckets, using 30 buckets per directory.

Introducing this hashing shrunk one repository from 102.12 to 10.96 Gbytes.

Blair


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by Karl Fogel <kf...@red-bean.com>.
"Ben Collins-Sussman" <su...@red-bean.com> writes:
> Alas, this design doesn't scale well when you have versioned
> directories with thousands of children.  Every attempt to add, delete,
> or modify a child in the directory causes the *entire* list of dirents
> to be loaded into memory, then serialized back to disk again.  As the
> number of children in the directory increases, it becomes O(N^2) to to
> modify them.

Oh, yeah -- I neglected to mention that we all agree this sucks :-).

> I'm dreaming of fixing this someday... perhaps by splitting up a
> directory's dirents across multiple string-keys.

+1

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by Greg Hudson <gh...@MIT.EDU>.
On Wed, 2008-06-11 at 20:27 -0500, Ben Collins-Sussman wrote:
> I haven't thought about it hard yet;  we just need to 'split up' a
> directory's entries into multiple places somehow.  Perhaps via hash
> buckets, as Blair has described.  The ultimate feature we want is that
> it should be *fast* to locate a single entry, and *fast* to append a
> new entry to a directory, regardless of how many children it has.

I think the ideal structure here is some kind of multi-headed btree.
Ignore the current structure of BDB and FSFS for a moment, and imagine
that all information about all revisions of a directory are located in
one file.  The file would end with a binary table mapping revisions to
block offsets, so you'd do some appropriate seeking math to look that up
and then you'd load a block of sorted directory entries.  A block could
indirect to other blocks between entries, so it might be able to say:

  abacus <rep key>
  <indirection to other block>
  marsupial <rep key>
  <indirection to other block>
  zebra <rep key>

Since you're going to load 4K blocks at a time, you'd expect it to
contain a whole bunch of entries and indirections.  Only for very large
directories would you need to indirect more than once, I think.

A change to a single entry would only result in having to write out a
few new blocks, typically; you would truncate off the revision index,
write out the new blocks, and then write out a new revision index.  If
the tree becomes too imbalanced, you have to write out some extra blocks
so that the new revision has a reasonably balanced tree, but according
to the btree theory I'm familiar with, that's still only O(lg(n)) extra
blocks to write out, or in practice, only a couple.



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by Ben Collins-Sussman <su...@red-bean.com>.
I haven't thought about it hard yet;  we just need to 'split up' a
directory's entries into multiple places somehow.  Perhaps via hash
buckets, as Blair has described.  The ultimate feature we want is that
it should be *fast* to locate a single entry, and *fast* to append a
new entry to a directory, regardless of how many children it has.  I'm
sure wikipedia has endless textbook algorithms on efficient filesystem
algorithms for this sort of thing.  We simply took the naive approach
when first wrote libsvn_fs_base.


On Wed, Jun 11, 2008 at 5:31 PM, Vyacheslav V. Zholudev
<vy...@gmail.com> wrote:
> Ben,
>
> maybe I got you wrong, but would it be efficient if you split directory
> entries across multiple strings with the same key? Will BDB work fast if
> there are thousand of entries with the same key? How are you gonna iterate
> them in a right way? Or this question hasn't been elaborated yet?
>
> Ben Collins-Sussman wrote:
>
> On Wed, Jun 11, 2008 at 4:45 PM, Karl Fogel <kf...@red-bean.com> wrote:
>
>
> "Vyacheslav V. Zholudev" <vy...@gmail.com> writes:
>
>
> Hello!
>
> I'm digging into BDB backend and thought that strings "table" can
> contain only either full text or deltas, but I found out that smth
> like:
>
> "((ins.xml 5 6.0.7) (install_log.xml 5 4.0.5) (cp.xml 5 7.0.8))"
>
> is written to the strings table as a fulltext, which reminds me some
> skel. Or is it smth else? If yes, how can i distinguish whether the
> full content of file is written or smth else?
> Thanks in advance!
>
>
> Vyacheslav,
>
> Since directories are just lists of entries anyway, we represent them
> using the same skel syntax used for reps metadata -- so that's how
> skel-like data ends up in the strings table.
>
> Each subcell above is a directory name followed by the name of its node
> revision (the name is a string, so the length comes first, followed by a
> space; see subversion/libsvn_fs_base/util/skel.h for details, but you've
> probably already read that file :-) ).
>
>
> Alas, this design doesn't scale well when you have versioned
> directories with thousands of children.  Every attempt to add, delete,
> or modify a child in the directory causes the *entire* list of dirents
> to be loaded into memory, then serialized back to disk again.  As the
> number of children in the directory increases, it becomes O(N^2) to to
> modify them.
>
> I'm dreaming of fixing this someday... perhaps by splitting up a
> directory's dirents across multiple string-keys.
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by "Vyacheslav V. Zholudev" <vy...@gmail.com>.
Ben,

maybe I got you wrong, but would it be efficient if you split directory 
entries across multiple strings with the same key? Will BDB work fast if 
there are thousand of entries with the same key? How are you gonna 
iterate them in a right way? Or this question hasn't been elaborated yet?

Ben Collins-Sussman wrote:
> On Wed, Jun 11, 2008 at 4:45 PM, Karl Fogel <kf...@red-bean.com> wrote:
>   
>> "Vyacheslav V. Zholudev" <vy...@gmail.com> writes:
>>     
>>> Hello!
>>>
>>> I'm digging into BDB backend and thought that strings "table" can
>>> contain only either full text or deltas, but I found out that smth
>>> like:
>>>
>>> "((ins.xml 5 6.0.7) (install_log.xml 5 4.0.5) (cp.xml 5 7.0.8))"
>>>
>>> is written to the strings table as a fulltext, which reminds me some
>>> skel. Or is it smth else? If yes, how can i distinguish whether the
>>> full content of file is written or smth else?
>>> Thanks in advance!
>>>       
>> Vyacheslav,
>>
>> Since directories are just lists of entries anyway, we represent them
>> using the same skel syntax used for reps metadata -- so that's how
>> skel-like data ends up in the strings table.
>>
>> Each subcell above is a directory name followed by the name of its node
>> revision (the name is a string, so the length comes first, followed by a
>> space; see subversion/libsvn_fs_base/util/skel.h for details, but you've
>> probably already read that file :-) ).
>>     
>
> Alas, this design doesn't scale well when you have versioned
> directories with thousands of children.  Every attempt to add, delete,
> or modify a child in the directory causes the *entire* list of dirents
> to be loaded into memory, then serialized back to disk again.  As the
> number of children in the directory increases, it becomes O(N^2) to to
> modify them.
>
> I'm dreaming of fixing this someday... perhaps by splitting up a
> directory's dirents across multiple string-keys.
>
>   


Re: skels in string table

Posted by Ben Collins-Sussman <su...@red-bean.com>.
On Wed, Jun 11, 2008 at 4:45 PM, Karl Fogel <kf...@red-bean.com> wrote:
> "Vyacheslav V. Zholudev" <vy...@gmail.com> writes:
>> Hello!
>>
>> I'm digging into BDB backend and thought that strings "table" can
>> contain only either full text or deltas, but I found out that smth
>> like:
>>
>> "((ins.xml 5 6.0.7) (install_log.xml 5 4.0.5) (cp.xml 5 7.0.8))"
>>
>> is written to the strings table as a fulltext, which reminds me some
>> skel. Or is it smth else? If yes, how can i distinguish whether the
>> full content of file is written or smth else?
>> Thanks in advance!
>
> Vyacheslav,
>
> Since directories are just lists of entries anyway, we represent them
> using the same skel syntax used for reps metadata -- so that's how
> skel-like data ends up in the strings table.
>
> Each subcell above is a directory name followed by the name of its node
> revision (the name is a string, so the length comes first, followed by a
> space; see subversion/libsvn_fs_base/util/skel.h for details, but you've
> probably already read that file :-) ).

Alas, this design doesn't scale well when you have versioned
directories with thousands of children.  Every attempt to add, delete,
or modify a child in the directory causes the *entire* list of dirents
to be loaded into memory, then serialized back to disk again.  As the
number of children in the directory increases, it becomes O(N^2) to to
modify them.

I'm dreaming of fixing this someday... perhaps by splitting up a
directory's dirents across multiple string-keys.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by Karl Fogel <kf...@red-bean.com>.
"Vyacheslav V. Zholudev" <vy...@gmail.com> writes:
> Hello!
>
> I'm digging into BDB backend and thought that strings "table" can
> contain only either full text or deltas, but I found out that smth
> like:
>
> "((ins.xml 5 6.0.7) (install_log.xml 5 4.0.5) (cp.xml 5 7.0.8))"
>
> is written to the strings table as a fulltext, which reminds me some
> skel. Or is it smth else? If yes, how can i distinguish whether the
> full content of file is written or smth else?
> Thanks in advance!

Vyacheslav,

Since directories are just lists of entries anyway, we represent them
using the same skel syntax used for reps metadata -- so that's how
skel-like data ends up in the strings table.  

Each subcell above is a directory name followed by the name of its node
revision (the name is a string, so the length comes first, followed by a
space; see subversion/libsvn_fs_base/util/skel.h for details, but you've
probably already read that file :-) ).

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: skels in string table

Posted by Branko Čibej <br...@xbc.nu>.
Vyacheslav V. Zholudev wrote:
> Hello!
>
> I'm digging into BDB backend and thought that strings "table" can 
> contain only either full text or deltas, but I found out that smth like:
>
> "((ins.xml 5 6.0.7) (install_log.xml 5 4.0.5) (cp.xml 5 7.0.8))"
>
> is written to the strings table as a fulltext, which reminds me some 
> skel. Or is it smth else? If yes, how can i distinguish whether the 
> full content of file is written or smth else?

This is a fulltext of a directory, not a file. :)

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org