You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Greg Hudson <gh...@MIT.EDU> on 2002/10/09 17:10:13 UTC

Blue-sky idea: Representation reuse

Here's an idea for how to reap space savings for some kinds of use
cases.  Basically, a version control system tries to save repository
space in two ways:

  1. If two files are identical, only store one copy.
  2. If two files are similar, store one as a diff against the other.

Like most revision control systems, we rely on a file's history to
find these advantages.  We only look for similarity or identity
between a file and its most recent revision (or its copy history).
So, for example, if you do:

  svn import http://a/b linux-1.0 linux-1.0
  svn import http://a/b linux-1.1 linux-1.1
  ...

you won't get any space savings except from self-compression.
(Instead, you have to do all your imports to the same place and
interleave those with copies.)  Similarly, right now if you branch a
file, make extensive modifications to the file on one branch, and then
merge those mods onto the other branch, the repository records those
diffs twice instead of reusing the representation.

But it doesn't have to be that way.  It's hard to look for similarity
with random other files in the repository, but it's easy to look for
identity.  We can maintain a mapping from checksum to representation.
When you store a representation, if you find a match in the checksum
table, you verify that it's the same contents and use that rep
instead.

There are a few complications:

  * Right now, when we abort a transaction, we remove all the reps it
    refers to in mutable nodes.  We'll need to keep track of when we
    reused a representation to avoid destroying existing data.

  * We'll have to think about how redeltification interacts with
    representation reuse.

This idea shouldn't be too hard to implement, but it's not clear how
much it would help the average user.  So it's up in the air whether
it's worth the complexity.

An even more blue-sky idea, which Mike came up with, is to have a
background daemon running around looking for unexploited rep
similarities and redeltifying things.  I think this is too hard
(there's no good heuristic for finding similar files except by doing
O(n^2) comparisons... though you might be able to play games with file
sizes or statistical characteristics of the contents, I suppose), but
it's an option.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by Branko Čibej <br...@xbc.nu>.

Greg Hudson wrote:

>Here's an idea for how to reap space savings for some kinds of use
>cases.  Basically, a version control system tries to save repository
>space in two ways:
>
>  1. If two files are identical, only store one copy.
>  2. If two files are similar, store one as a diff against the other.
>
>Like most revision control systems, we rely on a file's history to
>find these advantages.  We only look for similarity or identity
>between a file and its most recent revision (or its copy history).
>So, for example, if you do:
>
>  svn import http://a/b linux-1.0 linux-1.0
>  svn import http://a/b linux-1.1 linux-1.1
>  ...
>
>you won't get any space savings except from self-compression.
>(Instead, you have to do all your imports to the same place and
>interleave those with copies.)  Similarly, right now if you branch a
>file, make extensive modifications to the file on one branch, and then
>merge those mods onto the other branch, the repository records those
>diffs twice instead of reusing the representation.
>
>But it doesn't have to be that way.  It's hard to look for similarity
>with random other files in the repository, but it's easy to look for
>identity.  We can maintain a mapping from checksum to representation.
>When you store a representation, if you find a match in the checksum
>table, you verify that it's the same contents and use that rep
>instead.
>
>There are a few complications:
>
>  * Right now, when we abort a transaction, we remove all the reps it
>    refers to in mutable nodes.  We'll need to keep track of when we
>    reused a representation to avoid destroying existing data.
>
That's a "imple" matter of adding a reference count to the rep

>  * We'll have to think about how redeltification interacts with
>    representation reuse.
>
Hah, yes, redeltifying along different history paths that just happen to 
use the same representation could be fun. :-)

>This idea shouldn't be too hard to implement, but it's not clear how
>much it would help the average user.  So it's up in the air whether
>it's worth the complexity.
>
Some repositories would see a huge benefit; others wouldn see any. Maybe 
it would be best to make this optional, to be turned on when you create 
a repository. "svnadmin create --squashed repo".

>An even more blue-sky idea, which Mike came up with, is to have a
>background daemon running around looking for unexploited rep
>similarities and redeltifying things.  I think this is too hard
>(there's no good heuristic for finding similar files except by doing
>O(n^2) comparisons... though you might be able to play games with file
>sizes or statistical characteristics of the contents, I suppose), but
>it's an option.
>
This would probably be more useful as an offline utility; "svnadmin 
compress"?


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by Jim Blandy <ji...@red-bean.com>.

mark benedetto king <bk...@Inquira.Com> writes:
> On Wed, Oct 09, 2002 at 10:22:57PM -0500, Jim Blandy wrote:
> > You should look at the papers surrounding rsync.  They had stuff in
> > there for detecting plagarism in a corpus of student work; that
> > definitely involved recognizing partial matches in a large corpus of
> > files.  This has definitely been done.
> 
> Yes, it has been done, but AFAIK, there are the "sound" O(n^2) approaches,
> and the "practical" O(n log n) approaches.

Sorry, I don't understand the distinction.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by mark benedetto king <bk...@inquira.com>.

On Sat, Oct 12, 2002 at 01:10:50PM -0400, Greg Hudson wrote:
> On Sat, 2002-10-12 at 09:11, mark benedetto king wrote:
> > Yes, it has been done, but AFAIK, there are the "sound" O(n^2) approaches,
> > and the "practical" O(n log n) approaches.
> 
> I'm still not sure how you get n log n.  Even if you can reduce the file
> contents to a hash with nice similarity properties, how do you find
> pairs of hashes which are similar?  Isn't that just O(n^2) with a
> smaller constant?

Hashing to a "document signature" and then pairwise-comparing the
hashes to find is, as you suggest, O(n^2) to cluster N documents.

You're having trouble thinking of the other ways because they
don't actually work. :-)

The N*log(N) ways typically require assigning a hash to each cluster.
Then you don't need to compare document-hashes, only document-hashes
to cluster-hashes (If the documents don't cluster, you get the O(N^2)
behaviour).  IMO, these approaches are fundamentally broken.  But
at least they terminate before the user does.

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by Greg Hudson <gh...@MIT.EDU>.

On Sat, 2002-10-12 at 09:11, mark benedetto king wrote:
> Yes, it has been done, but AFAIK, there are the "sound" O(n^2) approaches,
> and the "practical" O(n log n) approaches.

I'm still not sure how you get n log n.  Even if you can reduce the file
contents to a hash with nice similarity properties, how do you find
pairs of hashes which are similar?  Isn't that just O(n^2) with a
smaller constant?

> What if the repository were stored on a compressed filesystem?
> This could (IMO) give us similar total storage savings as
> a shared-representation model, without any additional code or
> testing.

A compressing filesystem would only exploit local similarities; if you
check in the same file ten times, you'll still get ten compressed copies
of the file.  It could only produce similar total storage savings by
numerical coincidence.

(Self-compressing plaintexts is probably a lot faster than storing our
whole database in a compressing filesystem.  Plus, as far as I can tell,
compressing filesystems just aren't that prevalent outside of Windows.)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by mark benedetto king <bk...@inquira.com>.

On Wed, Oct 09, 2002 at 10:22:57PM -0500, Jim Blandy wrote:
> 
> You should look at the papers surrounding rsync.  They had stuff in
> there for detecting plagarism in a corpus of student work; that
> definitely involved recognizing partial matches in a large corpus of
> files.  This has definitely been done.
> 

Yes, it has been done, but AFAIK, there are the "sound" O(n^2) approaches,
and the "practical" O(n log n) approaches.

What if the repository were stored on a compressed filesystem?
This could (IMO) give us similar total storage savings as
a shared-representation model, without any additional code or
testing.

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by Jim Blandy <ji...@red-bean.com>.

You should look at the papers surrounding rsync.  They had stuff in
there for detecting plagarism in a corpus of student work; that
definitely involved recognizing partial matches in a large corpus of
files.  This has definitely been done.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by Branko Čibej <br...@xbc.nu>.

Rick Bradley wrote:

>* Branko ??ibej (brane@xbc.nu) [021009 16:32]:
>  
>
>>>Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
>>>in F compute the signature s_i = S(f_i) and store it in a hash along
>>>with i.  Whenever an insert into the hash finds a prior string f_j
>>>there, then merge f_i and f_j in subversion.
>>>
>>>      
>>>
>>No, he was talking about finding _similar_ files. Sure, you can use a 
>>digest index to find _identical_ ones, but for similarity, only O(N^2) 
>>will do.
>>    
>>
>
>Consider what happens when your digest can be used as a distance metric:
>
>http://ixazon.dynip.com/~cmeclax/nilsimsa.html
>
>(or similar)
>
>Depending upon the degree of similarity required/desired one may still
>have to sort digests or use a priority queue to streamline comparisons,
>but that's still going to be better than O(n^2).  I don't know of any
>quadratic lower bounds for similarity comparisons.
>  
>

Hm, good point. Usually, when I think of digests, I assume cryptographic 
hashes, which you certainly can't use as a distance metric (at least I 
hope so; if you can, then i'm not using ssh any more :-).

Hm. Using a nilsimsa hash for a key would let you find the most similar 
representation in O(log N) time, then? Interesting indeed.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by Rick Bradley <sv...@rickbradley.com>.

* Branko ??ibej (brane@xbc.nu) [021009 16:32]:
> >Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
> >in F compute the signature s_i = S(f_i) and store it in a hash along
> >with i.  Whenever an insert into the hash finds a prior string f_j
> >there, then merge f_i and f_j in subversion.
> >
> No, he was talking about finding _similar_ files. Sure, you can use a 
> digest index to find _identical_ ones, but for similarity, only O(N^2) 
> will do.

Consider what happens when your digest can be used as a distance metric:

http://ixazon.dynip.com/~cmeclax/nilsimsa.html

(or similar)

Depending upon the degree of similarity required/desired one may still
have to sort digests or use a priority queue to streamline comparisons,
but that's still going to be better than O(n^2).  I don't know of any
quadratic lower bounds for similarity comparisons.

Rick
-- 
 http://www.rickbradley.com    MUPRN: 465    (66F/66F)
                       |  of the que. Hehheh.
   random email haiku  |  Rimboy needs a bachelor
                       |  flag on rimboy.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by Branko Čibej <br...@xbc.nu>.

Rick Bradley wrote:

>* Greg Hudson (ghudson@MIT.EDU) [021009 12:12]:
>  
>
>>An even more blue-sky idea, which Mike came up with, is to have a
>>background daemon running around looking for unexploited rep
>>similarities and redeltifying things.  I think this is too hard
>>(there's no good heuristic for finding similar files except by doing
>>O(n^2) comparisons... though you might be able to play games with file
>>sizes or statistical characteristics of the contents, I suppose), but
>>it's an option.
>>    
>>
>
>There's a much better way than O(n^2) as long as you have a signature
>function S() that suits your needs (md5, nilsimsa, soundex :-), etc.).  
>
>Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
>in F compute the signature s_i = S(f_i) and store it in a hash along
>with i.  Whenever an insert into the hash finds a prior string f_j
>there, then merge f_i and f_j in subversion.
>
No, he was talking about finding _similar_ files. Sure, you can use a 
digest index to find _identical_ ones, but for similarity, only O(N^2) 
will do.

>This has naive run-time O(n * (T(hash) + T(S))), excluding the
>complexity of the actual subversion merge, where T(hash) is the
>amortized run-time for a hash lookup/insert operation, and T(S) is the
>time to compute an S(f) signature.  Given reasonable implementations for
>the underlying hash data structure, and a sensible hashing algorithm
>this looks a lot like O(n) time.  Note that I'm obscuring a hidden
>union-find problem in tracking repeated merges but this will not
>significantly affect the run-time, and it's not clear to me at the
>moment that union-find is necessary for repeated merging in subversion
>if repository images are processed in temporal order.
>  
>


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Blue-sky idea: Representation reuse

Posted by Rick Bradley <sv...@rickbradley.com>.

* Greg Hudson (ghudson@MIT.EDU) [021009 12:12]:
> An even more blue-sky idea, which Mike came up with, is to have a
> background daemon running around looking for unexploited rep
> similarities and redeltifying things.  I think this is too hard
> (there's no good heuristic for finding similar files except by doing
> O(n^2) comparisons... though you might be able to play games with file
> sizes or statistical characteristics of the contents, I suppose), but
> it's an option.

There's a much better way than O(n^2) as long as you have a signature
function S() that suits your needs (md5, nilsimsa, soundex :-), etc.).  

Given a set of n files (strings) F = { f_0, ... f_n }, for each file f_i
in F compute the signature s_i = S(f_i) and store it in a hash along
with i.  Whenever an insert into the hash finds a prior string f_j
there, then merge f_i and f_j in subversion.

This has naive run-time O(n * (T(hash) + T(S))), excluding the
complexity of the actual subversion merge, where T(hash) is the
amortized run-time for a hash lookup/insert operation, and T(S) is the
time to compute an S(f) signature.  Given reasonable implementations for
the underlying hash data structure, and a sensible hashing algorithm
this looks a lot like O(n) time.  Note that I'm obscuring a hidden
union-find problem in tracking repeated merges but this will not
significantly affect the run-time, and it's not clear to me at the
moment that union-find is necessary for repeated merging in subversion
if repository images are processed in temporal order.

Rick
-- 
 http://www.rickbradley.com    MUPRN: 862    (68F/68F)
                       |  some of the answers
   random email haiku  |  I received from the research
                       |  side of the business.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

RE: Blue-sky idea: Representation reuse

Posted by Bill Tutt <ra...@lyra.org>.

Yeah, I've thought along similar things before. Having representation
reuse for full texts alone would be a good thing. However, we don't have
a handy index against the representation checksums. Let alone, an easy
way to join from Representations to the set of NodeRevisions that use
the Representation. Lacking this easy join is one of the blocking issues
before even considering implementing an "svn obliterate" for example.

Certainly something to keep in the back of our minds when deciding what
do to for post-1.0 schema.

Bill
----
Do you want a dangerous fugitive staying in your flat?
No.
Well, don't upset him and he'll be a nice fugitive staying in your flat.
 

> -----Original Message-----
> From: Greg Hudson [mailto:ghudson@MIT.EDU]
> Sent: Wednesday, October 09, 2002 10:10 AM
> To: dev@subversion.tigris.org
> Subject: Blue-sky idea: Representation reuse
> 
> Here's an idea for how to reap space savings for some kinds of use
> cases.  Basically, a version control system tries to save repository
> space in two ways:
> 
>   1. If two files are identical, only store one copy.
>   2. If two files are similar, store one as a diff against the other.
> 
> Like most revision control systems, we rely on a file's history to
> find these advantages.  We only look for similarity or identity
> between a file and its most recent revision (or its copy history).
> So, for example, if you do:
> 
>   svn import http://a/b linux-1.0 linux-1.0
>   svn import http://a/b linux-1.1 linux-1.1
>   ...
> 
> you won't get any space savings except from self-compression.
> (Instead, you have to do all your imports to the same place and
> interleave those with copies.)  Similarly, right now if you branch a
> file, make extensive modifications to the file on one branch, and then
> merge those mods onto the other branch, the repository records those
> diffs twice instead of reusing the representation.
> 
> But it doesn't have to be that way.  It's hard to look for similarity
> with random other files in the repository, but it's easy to look for
> identity.  We can maintain a mapping from checksum to representation.
> When you store a representation, if you find a match in the checksum
> table, you verify that it's the same contents and use that rep
> instead.
> 
> There are a few complications:
> 
>   * Right now, when we abort a transaction, we remove all the reps it
>     refers to in mutable nodes.  We'll need to keep track of when we
>     reused a representation to avoid destroying existing data.
> 
>   * We'll have to think about how redeltification interacts with
>     representation reuse.
> 
> This idea shouldn't be too hard to implement, but it's not clear how
> much it would help the average user.  So it's up in the air whether
> it's worth the complexity.
> 
> An even more blue-sky idea, which Mike came up with, is to have a
> background daemon running around looking for unexploited rep
> similarities and redeltifying things.  I think this is too hard
> (there's no good heuristic for finding similar files except by doing
> O(n^2) comparisons... though you might be able to play games with file
> sizes or statistical characteristics of the contents, I suppose), but
> it's an option.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org