You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by kf...@collab.net on 2005/04/26 22:51:57 UTC

Re: Using md5sum for svn status

Marcus Rueckert <da...@web.de> writes:
> According to Ben (sussman) the current change detection does the
> following steps:
> 
> 1. load entries file into memory
> 2. stats the file
> 3. if the timestamps matches -> returns NOT_CHANGED
> 4. if the timestamps differ it stats the text base.
> 5. if the size of text base and file differ -> returns CHANGED
> 6. if the sizes match it does a byte-by-byte comparison.
> 
> I think step 6 can be optimized a bit.
> The entries file has the md5sum of the text-base stored.
> Why dont we just read the working file and md5sum the content.
> This way we only need to read 1 file into memory (the working file) and
> the md5sum algorithm might be faster than the diff algorithm.
> 
> any comments?

To calculate an MD5 sum, you must read every byte in the file.

To discover that there is some difference between two files, you must
read, on average, halfway through both files -- once you encounter a
mismatch, you can stop.  (This why Unix 'diff' and 'cmp' are not the
same thing.)

So I don't see that there's a big win here...

Best,
-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Using md5sum for svn status

Posted by Daniel Berlin <db...@dberlin.org>.
On Tue, 2005-04-26 at 17:51 -0500, kfogel@collab.net wrote:
> Marcus Rueckert <da...@web.de> writes:
> > According to Ben (sussman) the current change detection does the
> > following steps:
> > 
> > 1. load entries file into memory
> > 2. stats the file
> > 3. if the timestamps matches -> returns NOT_CHANGED
> > 4. if the timestamps differ it stats the text base.
> > 5. if the size of text base and file differ -> returns CHANGED
> > 6. if the sizes match it does a byte-by-byte comparison.
> > 
> > I think step 6 can be optimized a bit.
> > The entries file has the md5sum of the text-base stored.
> > Why dont we just read the working file and md5sum the content.
> > This way we only need to read 1 file into memory (the working file) and
> > the md5sum algorithm might be faster than the diff algorithm.
> > 
> > any comments?
> 
> To calculate an MD5 sum, you must read every byte in the file.

Actually, you could do the md5sum in rolling fashion and stop when the
two don't match.
:)
But this is no faster than comparing bytes (in part because MD5 is not
magic)



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Using md5sum for svn status

Posted by Philip Martin <ph...@codematters.co.uk>.
kfogel@collab.net writes:

> Marcus Rueckert <da...@web.de> writes:
>> According to Ben (sussman) the current change detection does the
>> following steps:
>> 
>> 1. load entries file into memory
>> 2. stats the file
>> 3. if the timestamps matches -> returns NOT_CHANGED
>> 4. if the timestamps differ it stats the text base.
>> 5. if the size of text base and file differ -> returns CHANGED
>> 6. if the sizes match it does a byte-by-byte comparison.
>> 
>> I think step 6 can be optimized a bit.
>> The entries file has the md5sum of the text-base stored.
>> Why dont we just read the working file and md5sum the content.
>> This way we only need to read 1 file into memory (the working file) and
>> the md5sum algorithm might be faster than the diff algorithm.
>> 
>> any comments?
>
> To calculate an MD5 sum, you must read every byte in the file.
>
> To discover that there is some difference between two files, you must
> read, on average, halfway through both files -- once you encounter a
> mismatch, you can stop.  (This why Unix 'diff' and 'cmp' are not the
> same thing.)
>
> So I don't see that there's a big win here...

That ignores keyword expansion and eol conversion.  When svn:keywords
or svn:eol-style is in use we "detranslate" the working file before
doing a byte-for-byte comparison between the detranslated file and the
text-base.  If we stored the md5sum of the translated file we could
avoid the detranslation; that could well be a win.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: Using md5sum for svn status

Posted by Ben Collins-Sussman <su...@collab.net>.
On Apr 27, 2005, at 7:35 AM, Dirk Schenkewitz wrote:

> When I read the following, I had an idea:
>
> Ben Collins-Sussman wrote:
>> On Apr 26, 2005, at 11:57 PM, Peter McNab wrote:
>>>> I was looking at how TortoiseSVN stores a base working copy of 
>>>> files with the WC, which is great for instant diff, but is not so 
>>>> meaningfull for binary files.
>> This isn't TortoiseSVN, is this how the svn client libraries work, 
>> it's not something you can turn off.  It's not just for running 'svn 
>> diff' without hitting the network, it also means you can 'svn revert' 
>> without hitting the network, and it means 'svn commit' can send 
>> binary diffs to the server, instead of the entire file.
>>>> We are hearing from folks on the list who have 500Mb binaries so 
>>>> unnecessary download and local duplication of these might be a 
>>>> productive goal for Subversion.
>>>>
>> Yes, someday (i.e. svn 2.0) the cache-in-working-copy will be 
>> optional.
>
> How about storing the "secret copy" as compressed files, as an option?
> This would increase the operating times very much, but it would save
> some disk space.
>

Yep, that's been discussed too.  The current thinking is that the 
text-bases will be a choice of {yes | no | compressed}.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Using md5sum for svn status

Posted by Dirk Schenkewitz <sc...@docomolab-euro.com>.
When I read the following, I had an idea:

Ben Collins-Sussman wrote:
> 
> On Apr 26, 2005, at 11:57 PM, Peter McNab wrote:
> 
>>> I was looking at how TortoiseSVN stores a base working copy of files 
>>> with the WC, which is great for instant diff, but is not so 
>>> meaningfull for binary files.
> 
> 
> This isn't TortoiseSVN, is this how the svn client libraries work, it's 
> not something you can turn off.  It's not just for running 'svn diff' 
> without hitting the network, it also means you can 'svn revert' without 
> hitting the network, and it means 'svn commit' can send binary diffs to 
> the server, instead of the entire file.
> 
>>> We are hearing from folks on the list who have 500Mb binaries so 
>>> unnecessary download and local duplication of these might be a 
>>> productive goal for Subversion.
>>>
> 
> Yes, someday (i.e. svn 2.0) the cache-in-working-copy will be optional.

How about storing the "secret copy" as compressed files, as an option?
This would increase the operating times very much, but it would save
some disk space.

Best regards
   Dirk

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Using md5sum for svn status

Posted by Ben Collins-Sussman <su...@collab.net>.
On Apr 26, 2005, at 11:57 PM, Peter McNab wrote:

>> I was looking at how TortoiseSVN stores a base working copy of files 
>> with the WC, which is great for instant diff, but is not so 
>> meaningfull for binary files.

This isn't TortoiseSVN, is this how the svn client libraries work, it's 
not something you can turn off.  It's not just for running 'svn diff' 
without hitting the network, it also means you can 'svn revert' without 
hitting the network, and it means 'svn commit' can send binary diffs to 
the server, instead of the entire file.

>> We are hearing from folks on the list who have 500Mb binaries so 
>> unnecessary download and local duplication of these might be a 
>> productive goal for Subversion.
>>

Yes, someday (i.e. svn 2.0) the cache-in-working-copy will be optional.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: Using md5sum for svn status

Posted by Peter McNab <mc...@melbpc.org.au>.
Peter McNab wrote:

> kfogel@collab.net wrote:
>
>> Marcus Rueckert <da...@web.de> writes:
>>  
>>
>>> According to Ben (sussman) the current change detection does the
>>> following steps:
>>>
>>> 1. load entries file into memory
>>> 2. stats the file
>>> 3. if the timestamps matches -> returns NOT_CHANGED
>>> 4. if the timestamps differ it stats the text base.
>>> 5. if the size of text base and file differ -> returns CHANGED
>>> 6. if the sizes match it does a byte-by-byte comparison.
>>>
>>> I think step 6 can be optimized a bit.
>>> The entries file has the md5sum of the text-base stored.
>>> Why dont we just read the working file and md5sum the content.
>>> This way we only need to read 1 file into memory (the working file) and
>>> the md5sum algorithm might be faster than the diff algorithm.
>>>
>>> any comments?
>>>   
>>
>>
>> To calculate an MD5 sum, you must read every byte in the file.
>>
>> To discover that there is some difference between two files, you must
>> read, on average, halfway through both files -- once you encounter a
>> mismatch, you can stop.  (This why Unix 'diff' and 'cmp' are not the
>> same thing.)
>>
>> So I don't see that there's a big win here...
>>
>> Best,
>> -Karl
>>
>>  
>>
> I'm a little surprised this "summary" info (1..5) isn't made available 
> by enquiry of the server and if step 6 is required then and only then 
> bring down the file.
> I was looking at how TortoiseSVN stores a base working copy of files 
> with the WC, which is great for instant diff, but is not so 
> meaningfull for binary files.
> Somehow I thing an option to get and hold the summary info locally and 
> only download the full file when absolutely necessary might be a good 
> thing.
> We are hearing from folks on the list who have 500Mb binaries so 
> unnecessary download and local duplication of these might be a 
> productive goal for Subversion.
>
> Peter
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org