You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by "Brian W. Fitzpatrick" <fi...@collab.net> on 2004/12/20 22:36:55 UTC

Proposal: Improved locking implementation for fsfs

The current tree-based lock-storage mechanism in fsfs doesn't account
for variations in underlying filesystems's filename encoding support.

Peter Lundblad and I have been discussing several different ways of
reimplementing the fsfs lock storage mechanism, and below is a
proposed implementation (and a variant of it) to address these
shortcomings, based on email, irc, and face-to-face conversations with
lundblad, mbk, ghudson, sussman, and kfogel.  I'm hoping that we can
come to consensus on one of these implementations, and I'll start
writing it after the start of the New Year (unless, of course, someone
beats me to it :).

This proposal sticks with the current mechanism of using the
filesystem as a tree to store the locks themselves, and the
lock-tokens (which merely contain a pointer to the lock) are all
stored in one directory, keyed on their token/uuid.


The Proposal: All The World's A Hash.

Summary: Hash each path component of a lock path and store locks in
the filesystem under the hashed name.

Details:

The path to a lock is broken into its components, and each path
component is hashed using the first 16 characters from the MD5 hash of
the component name.  We'll only use the first 16 characters to avoid
issues with systems whose PATH_MAX is only 256 bytes (16 char
component names + 1 char path separators gives us a maximum of 15 path
components on such systems.  For example:

foo/bar/baz.c becomes

foo -> d3b07384d113edec
bar -> c157a79031e1c40f
baz.c -> 72f8c9bac968893a

Lock storage path: 

repos/db/locks/d3b07384d113edec/c157a79031e1c40f/72f8c9bac968893a

Lockfiles can contain multiple lock representations, so collisions
will not be an issue (think of the end file as a hash bucket).


Variation A: The Hash/ASCII Experiment

Summary: Force each path component to some 127bit ASCII representation
and store locks the filesystem under a lowercase "ASCII-ized" name.
Use a hash for names that can't be efficiently "ASCII-ized".

Details:

The path to a lock is broken into its components, and each path
component is translated into its lowercase ASCII equivalent (Using the
4 character hex code) for non-ASCII characters.  If the resulting
"ASCII-ized" string is both a)longer than 16 characters and b) greater
than 1.5x times the length of the original string, we fall back onto
the 16 byte MD5 hash as described in the main proposal.

Example 1:
  foo/BAR/baz.c becomes

  foo -> foo
  BAR -> bar
  baz.c -> baz.c

  Lock storage path: 

  repos/db/locks/foo/bar/baz.c

Example 2:
  foo/bör/baz.c becomes

  foo -> foo
  bör -> bar
  baz.c -> baz.c

  Lock storage path: 

  repos/db/locks/foo/b00F6r/baz.c

Example 3:
  /foo/bar/böööö.c becomes

  foo -> foo
  bar -> bar
  böööö.c -> b00f600f600f600f6.c, which is longer than 16 chars and
                                  1.5x the length of the original path
                                   component, so the hash is used instead:
             edcb45bd435bfc99

  Lock storage path: 

  repos/db/locks/foo/bar/edcb45bd435bfc99


Thotz? Comments?

-Fitz




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Proposal: Improved locking implementation for fsfs

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Mon, 20 Dec 2004, Max Bowsher wrote:

> Brian W. Fitzpatrick wrote:
> > Lock storage path:
> >
> > repos/db/locks/d3b07384d113edec/c157a79031e1c40f/72f8c9bac968893a
> >
> > Lockfiles can contain multiple lock representations, so collisions
> > will not be an issue (think of the end file as a hash bucket).
>
> Could problems arise if a commit turns a locked file into a directory of the
> same name?
>
I don't think so. The lock file will be deleted when the file is deleted
(I assume). The commit can't lock any files below the new directory. I
don't see a problem the other way round either. When a directory is
deleted, the subtree will also be deleted during commit finalization.

Hmm, but what about locking a non-existent path? That's not a problem in
our own client, but say a DAV client locks foo/bar/dir/file, where foo/bar
is a file already. I assume it has to delete foo/bar first.

Thanks,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Proposal: Improved locking implementation for fsfs

Posted by Max Bowsher <ma...@ukf.net>.
Brian W. Fitzpatrick wrote:
> Lock storage path:
>
> repos/db/locks/d3b07384d113edec/c157a79031e1c40f/72f8c9bac968893a
>
> Lockfiles can contain multiple lock representations, so collisions
> will not be an issue (think of the end file as a hash bucket).

Could problems arise if a commit turns a locked file into a directory of the 
same name?

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Proposal: Improved locking implementation for fsfs

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Tue, 21 Dec 2004, Helge Jensen wrote:

> kfogel@collab.net wrote:
>
> > Specifically, for certain common non-ASCII characters
> > (e.g., most of those representable in ISO-8859-*), we could use an
> > encoding that reads more intuitively to a human:
> >
> >    foo/bör/baz.c ==> foo/bor/baz.c
> >
> > or
> >
> >    foo/bör/baz.c ==> foo/boer/baz.c
> >
> > This would make the lock tree more debuggable in common cases.
>
> Here, both "foo/bör" and "foo/boer" would translate to the same lock-path.
>
> Basically transformed self-representation needs some escape mechanism no
> matter how you twist and turn it.
>
> This is a built-in problem of variation A.
>
>
In this whole proposal, the filename used in the locking directory is
treated like a hash code in a hash table, so this is just a hash
collision. Then we just store multiple lock entries in the same file. This
isn't specific to variation A, since using just a part of the MD5 hash
increases the risk for collisions.

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Proposal: Improved locking implementation for fsfs

Posted by Helge Jensen <he...@slog.dk>.
kfogel@collab.net wrote:

When translating self-representations you need to ensure injectivity.

> Specifically, for certain common non-ASCII characters
> (e.g., most of those representable in ISO-8859-*), we could use an
> encoding that reads more intuitively to a human:
> 
>    foo/bör/baz.c ==> foo/bor/baz.c
> 
> or
> 
>    foo/bör/baz.c ==> foo/boer/baz.c
> 
> This would make the lock tree more debuggable in common cases.

Here, both "foo/bör" and "foo/boer" would translate to the same lock-path.

Basically transformed self-representation needs some escape mechanism no 
matter how you twist and turn it.

This is a built-in problem of variation A.

-- 
Helge

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Proposal: Improved locking implementation for fsfs

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Tue, 20 Dec 2004 kfogel@collab.net wrote:

> "Brian W. Fitzpatrick" <fi...@collab.net> writes:
> > Variation A: The Hash/ASCII Experiment
> >
> > Summary: Force each path component to some 127bit ASCII representation
> > and store locks the filesystem under a lowercase "ASCII-ized" name.
> > Use a hash for names that can't be efficiently "ASCII-ized".
> >
> > Details:
> >
> > The path to a lock is broken into its components, and each path
> > component is translated into its lowercase ASCII equivalent (Using the
> > 4 character hex code) for non-ASCII characters.  If the resulting
> > "ASCII-ized" string is both a)longer than 16 characters and b) greater
> > than 1.5x times the length of the original string, we fall back onto
> > the 16 byte MD5 hash as described in the main proposal.
>
> I like Variation A, despite its slightly greater complexity.
>
I don't think it is essential, since you can use grep for debugging, but
I'm not against it either. Just be careful with device names on Windows.
(See r9895).

 > Note that since we have a collision-resolution mechanism anyway, we
> have some flexibility in our hash function.  We needn't encode all
> non-ASCII as 4 character hex codes; we could encode them in some more
> friendly way.  Specifically, for certain common non-ASCII characters
> (e.g., most of those representable in ISO-8859-*), we could use an
> encoding that reads more intuitively to a human:
>
Yes, we can, but I see even less value in this. You know, 8859-1 is common
in one part of the world... ASCII is ASCII, but going further becomes
quite arbitrary IMO

Regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Proposal: Improved locking implementation for fsfs

Posted by kf...@collab.net.
"Brian W. Fitzpatrick" <fi...@collab.net> writes:
> Variation A: The Hash/ASCII Experiment
> 
> Summary: Force each path component to some 127bit ASCII representation
> and store locks the filesystem under a lowercase "ASCII-ized" name.
> Use a hash for names that can't be efficiently "ASCII-ized".
> 
> Details:
> 
> The path to a lock is broken into its components, and each path
> component is translated into its lowercase ASCII equivalent (Using the
> 4 character hex code) for non-ASCII characters.  If the resulting
> "ASCII-ized" string is both a)longer than 16 characters and b) greater
> than 1.5x times the length of the original string, we fall back onto
> the 16 byte MD5 hash as described in the main proposal.

I like Variation A, despite its slightly greater complexity.

Note that since we have a collision-resolution mechanism anyway, we
have some flexibility in our hash function.  We needn't encode all
non-ASCII as 4 character hex codes; we could encode them in some more
friendly way.  Specifically, for certain common non-ASCII characters
(e.g., most of those representable in ISO-8859-*), we could use an
encoding that reads more intuitively to a human:

   foo/bör/baz.c ==> foo/bor/baz.c

or

   foo/bör/baz.c ==> foo/boer/baz.c

This would make the lock tree more debuggable in common cases.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Proposal: Improved locking implementation for fsfs

Posted by Branko Čibej <br...@xbc.nu>.
Peter N. Lundblad wrote:

>On Wed, 22 Dec 2004, [UTF-8] Branko �^Libej wrote:
>  
>
>>So I'd suggest using a scheme that scales slightly better, unless that's
>>really a huge pain to do.
>>
>Maybe. Any suggestions? Rehashing? Not very funny in the file system. One
>more character? Gives 4096 directories and about 10 million files with
>reasonalbe directory size. Or, anything else?
>  
>
More than one level of hierarchy.

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Proposal: Improved locking implementation for fsfs

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Wed, 22 Dec 2004, [UTF-8] Branko �^Libej wrote:

> Peter N. Lundblad wrote:
>
> >On Mon, 20 Dec 2004, Brian W. Fitzpatrick wrote:
> >
> >
> >
> >>This proposal sticks with the current mechanism of using the
> >>filesystem as a tree to store the locks themselves, and the
> >>lock-tokens (which merely contain a pointer to the lock) are all
> >>stored in one directory, keyed on their token/uuid.
> >>
> >>
> >>
> >Just for the record, we discussed the lock-tokens storage further on IRC
> >and agreed that we should use a one-level hierarchy, using the first few
> >bytes of the lock-token to spread these files in directories. This helps
> >filesystems with problems having many files in one directory. Using two
> >characters gives 256 directories in the worst case. That seems to be a
> >reasonable balane between many files in one directory and one directory
> >per lock token. We don't expect millions of locks, do we?
> >
> >
> Not "expect", no, but we've been burned before where we based out
> implementation on something we thought was a reasonable common case,
> then it turned out people were doing something quite unreasonable (such
> as, e.g., putting 50000 files in one directory).
>
> So I'd suggest using a scheme that scales slightly better, unless that's
> really a huge pain to do.
>
Maybe. Any suggestions? Rehashing? Not very funny in the file system. One
more character? Gives 4096 directories and about 10 million files with
reasonalbe directory size. Or, anything else?

Thanks,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: Proposal: Improved locking implementation for fsfs

Posted by Branko Čibej <br...@xbc.nu>.
Peter N. Lundblad wrote:

>On Mon, 20 Dec 2004, Brian W. Fitzpatrick wrote:
>
>  
>
>>This proposal sticks with the current mechanism of using the
>>filesystem as a tree to store the locks themselves, and the
>>lock-tokens (which merely contain a pointer to the lock) are all
>>stored in one directory, keyed on their token/uuid.
>>
>>    
>>
>Just for the record, we discussed the lock-tokens storage further on IRC
>and agreed that we should use a one-level hierarchy, using the first few
>bytes of the lock-token to spread these files in directories. This helps
>filesystems with problems having many files in one directory. Using two
>characters gives 256 directories in the worst case. That seems to be a
>reasonable balane between many files in one directory and one directory
>per lock token. We don't expect millions of locks, do we?
>  
>
Not "expect", no, but we've been burned before where we based out 
implementation on something we thought was a reasonable common case, 
then it turned out people were doing something quite unreasonable (such 
as, e.g., putting 50000 files in one directory).

So I'd suggest using a scheme that scales slightly better, unless that's 
really a huge pain to do.

-- Brane


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Proposal: Improved locking implementation for fsfs

Posted by "Peter N. Lundblad" <pe...@famlundblad.se>.
On Mon, 20 Dec 2004, Brian W. Fitzpatrick wrote:

>
> This proposal sticks with the current mechanism of using the
> filesystem as a tree to store the locks themselves, and the
> lock-tokens (which merely contain a pointer to the lock) are all
> stored in one directory, keyed on their token/uuid.
>
Just for the record, we discussed the lock-tokens storage further on IRC
and agreed that we should use a one-level hierarchy, using the first few
bytes of the lock-token to spread these files in directories. This helps
filesystems with problems having many files in one directory. Using two
characters gives 256 directories in the worst case. That seems to be a
reasonable balane between many files in one directory and one directory
per lock token. We don't expect millions of locks, do we?

regards,
//Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org