You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Andy Whitcroft <ap...@shadowen.org> on 2004/04/17 12:00:49 UTC

File modification detection?

It appears that the client file modification detection is incorrectly checking modification time and assuming that if that is unchanged the file is unchanged even in the face of the size/inode number changing.

Reproduce recipe follows:
	
	svn --version
	rm -rf REPO repo
	svnadmin create REPO
	svn checkout file://localhost/`pwd`/REPO repo
	cd repo
	touch foo
	svn add foo
	svn commit -m add
	echo "one" >one; echo "two two" >two
	ls -li one two
	mv one foo
	svn commit -m one
	ls -li
	mv two foo
	svn commit -m two
	ls -li

Here is a trace of running this recipe (sh -x), note htat when two is moved over foo it is blatently a different sized file, but it is not checked in.

	apw@voidhawk:~/test$ sh -x test
	+ svn --version
	svn, version 1.0.1 (r9023)
	   compiled Apr 11 2004, 01:09:01
	
	Copyright (C) 2000-2004 CollabNet.
	Subversion is open source software, see http://subversion.tigris.org/
	This product includes software developed by CollabNet (http://www.Collab.Net/).
	
	The following repository access (RA) modules are available:
	
	* ra_dav : Module for accessing a repository via WebDAV (DeltaV) protocol.
	  - handles 'http' schema
	  - handles 'https' schema
	* ra_local : Module for accessing a repository on local disk.
	  - handles 'file' schema
	* ra_svn : Module for accessing a repository using the svn network protocol.
	  - handles 'svn' schema
	
	+ rm -rf REPO repo
	+ svnadmin create REPO
	++ pwd
	+ svn checkout file://localhost//home/apw/test/REPO repo
	Checked out revision 0.
	+ cd repo
	+ touch foo
	+ svn add foo
	A         foo
	+ svn commit -m add
	Adding         foo
	Transmitting file data .
	Committed revision 1.
	+ echo one
	+ echo 'two two'
	+ ls -li one two
	3875570 -rw-r--r--    1 apw      apw             4 Apr 17 12:58 one
	3875571 -rw-r--r--    1 apw      apw             8 Apr 17 12:58 two
	+ mv one foo
	+ svn commit -m one
	Sending        foo
	Transmitting file data .
	Committed revision 2.
	+ ls -li
	total 8
	3875570 -rw-r--r--    1 apw      apw             4 Apr 17 12:58 foo
	3875571 -rw-r--r--    1 apw      apw             8 Apr 17 12:58 two
	+ mv two foo
	+ svn commit -m two
	+ ls -li
	total 4
	3875571 -rw-r--r--    1 apw      apw             8 Apr 17 12:58 foo
	apw@voidhawk:~/test$

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Greg Hudson <gh...@MIT.EDU>.

On Sat, 2004-04-17 at 16:39, Ben Collins-Sussman wrote:
> I'm worried that the only way to prevent this is to always-stat the 
> text-base (which will slow down things even further, since we're doing 
> two file stats instead of one), or to record the text-base's size in our 
> entries file.

The latter seems like a better solution.

> Still, though, I could imagine somebody 'moving' a file with identical 
> timestamp *and* identical size on top of the working file, and still 
> fooling the algorithm.  So is this road even worth pursuing?

Well, on Unix, we could still fix that case by looking at the inode
number, as Andy suggests.  Not sure about Windows.

> Is there some nice philosophical quip that allows us to brush off this
> whole scenario?

I'm afraid not.  The best we can do is, "this situation seems pretty
unusual; if you hit it, we're sorry."


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Philip Martin <ph...@codematters.co.uk>.

Ben Collins-Sussman <su...@collab.net> writes:

> Andy Whitcroft wrote:
>
>> It appears that the client file modification detection is
>> incorrectly checking modification time and assuming that if that is
>> unchanged the file is unchanged even in the face of the size/inode
>> number changing.
>
> That is exactly what's happening.  SVN uses the same algorithm as CVS
> to detect if a file has changed:
>
>    if (file's timestamp == one recorded in .svn/entries)
>       return NOT_CHANGED
>    else if (size of file != size of .svn/text-base/file)
>       return CHANGED
>    else
>       compare file and .svn/text-base/file byte-by-byte.

That's not really correct, it would not handle keywords or eol-style.

> So what you're doing is 'mv'ing a different file with identical
> timestamp on top of the file, and the algorithm is falling down.
>
> Developers: is this a legitimate hole in our algorithm?  Should we
> change it to cover this edge-case?

It's certainly not a new problem, it's been seen lots of times on
Windows with files have the wrong eol-style in the repository.
Initially the files show up as unmodified (while the timestamps
match), once the timestamps change the eol difference means the files
show up as modified.

> I'm worried that the only way to prevent this is to always-stat the
> text-base (which will slow down things even further, since we're doing
> two file stats instead of one), or to record the text-base's size in
> our entries file.

Ha!  The current wc code makes very little attempt to minimise stat
calls (or any other system call).  Try it: when a file is modified the
current status code will stat the text-base several times, not to
mention stating the working file several times as well.  There is an
open issue about this.

In the past I have made some optimisations, but I don't think the
current code is ever going to be efficient as far as minimising system
calls is concerned.  Too many of the interfaces use char* paths which
leads to repeated calls to svn_io_check_path and svn_io_stat.

I would guess that some of the existing "optimisations" may turn out
to be pessimisations.  Take svn_wc_text_modified_p as an example: the
first thing it does is detect missing working files and that involves
a call to svn_io_check_path, i.e. a stat call.  If the file is present
and the code will go on to compare timestamps, which involves another
stat call.  Are missing files common?  Is the missing file code an
optimisation or a pessimisation?

> Still, though, I could imagine somebody 'moving' a file with identical
> timestamp *and* identical size on top of the working file, and still
> fooling the algorithm.  So is this road even worth pursuing?  Is there
> some nice philosophical quip that allows us to brush off this whole
> scenario?  Something like, "don't fool with timestamps, kid".

In the end a byte-by-byte comparison is the only way to guarantee that
changes are detected.  Making that the default would make Subversion
very slow.  In the past it has been suggested that this should be
under user control.  It has also been suggested that the user be
allowed to select md5sums instead of byte-by-byte comparison, as that
would involve reading the working file but not the text-base, and so
might be an optimisation when the working copy is on a network disk.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Andy Whitcroft <ap...@shadowen.org>.

--On 23 April 2004 07:49 +0200 "Ph. Marek" <ph...@bmlv.gv.at> wrote:

> Currently svn tests the timestamp, and possibly the size.
> IMHO it would be better to test the size first - if that differs it is 
> guaranteed (modulo keyword expansion, which could be checked via 
> svn:keywords) that the files differ.
> 
> Alternatively, the md5sum of *parts* of the file could be stored (eg
> every 2MB  worth), which *could* speed up verifying.

As the original submitter of the issue.  I would be happy with it checking
the size and the date and assume if _both_ are the same the files are the
same.  I think that is a reasonable limitation.  What I was unhappy about
was it being blatant that the files were different.  Different _sizes_ and
it not notice.  I don't expect miracles, nor that the process be slowed 100
times for this corner case.

Cheers.

-apw

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by "Ph. Marek" <ph...@bmlv.gv.at>.

> > > But then Karl pointed out that while "on average" our current algorithm
> > > bails after reading half the text-base, this is cancelled out by the
> > > fact that it's reading two files instead of one.  So maybe the
> > > byte-for-byte and checksum strategies come out even?  :-)
> >
> > It's important to remember svn:keywords and svn:eol-style when
> > discussing working files and text bases.  When we do a byte-for-byte
> > comparison with those keywords set then first the entire working file
> > is read, it gets "detranslated" and a new, temporary, file in
> > repository format is written.  It's that temporary file that takes
> > part in the byte-for-byte comparison, and it's quite possible that the
> > write is the main performance hit.
> >
> > A possible optimisation would be to store a second md5sum for the
> > working file, and then do an md5sum comparison by reading the working
> > file and thus avoid the detranslate/write altogether.
>
> Just a note: Even without a second md5sum, you can avoid at least the
> write. In theory, you can do the detranslation and md5sum in chunks in
> memory without ever writing anything to disk. Like, having the
> detranslation going into a (non-disk) stream and doing the md5sum over
> that stream and then letting that stream go to /dev/null or such. That
> was the theory. Don't know how difficult it is to implement that in
> Subversion with the existing infrastructure.
I'd like to register a wish, which is so small that it doesn't need to be in 
the issue tracker :-) ?

Currently svn tests the timestamp, and possibly the size.
IMHO it would be better to test the size first - if that differs it is 
guaranteed (modulo keyword expansion, which could be checked via 
svn:keywords) that the files differ.

Alternatively, the md5sum of *parts* of the file could be stored (eg every 2MB 
worth), which *could* speed up verifying.

> > A possible problem is that the user may have edited an expanded
> > keyword causing the md5sum comparison to indicate a modification,
> > whereas the detranslate would drop that edit and indicate no
> > modification.
>
> IMHO, that's a separate issue. The first is how to recognize changes
> reliably (and fast), the second is how to force submits even when
> there is no actual change (with regard of what would be commited).
>
> To me, changing a generated keyword part feels the same as changing
> the timestamp. It indicates that there was some kind of change (and
> that a check of the content may be due), but that may have been
> effectively an no-op.
>
> Whether we want to support a "no-change" commit and how it would be
> triggered (a flag, some keyword change as suggested above, etc.) is a
> different question and shouldn't influence how we want to detect real
> changes. (If we get it cheaply, fine, but it shouldn't limit us.)
How about reversing that?
Changing the value of an expanded keyword doesn't change the file in the 
repository, but leads to a commit?
No. I'd possibly like to commit a binary file, which doesn't have keywords ...


Regards,

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Benjamin Pflugmann <be...@pflugmann.de>.

On Thu 2004-04-22 at 21:12:28 +0100, Philip Martin wrote:
> Ben Collins-Sussman <su...@collab.net> writes:
[...]
> > But then Karl pointed out that while "on average" our current algorithm
> > bails after reading half the text-base, this is cancelled out by the
> > fact that it's reading two files instead of one.  So maybe the
> > byte-for-byte and checksum strategies come out even?  :-)
> 
> It's important to remember svn:keywords and svn:eol-style when
> discussing working files and text bases.  When we do a byte-for-byte
> comparison with those keywords set then first the entire working file
> is read, it gets "detranslated" and a new, temporary, file in
> repository format is written.  It's that temporary file that takes
> part in the byte-for-byte comparison, and it's quite possible that the
> write is the main performance hit.
> 
> A possible optimisation would be to store a second md5sum for the
> working file, and then do an md5sum comparison by reading the working
> file and thus avoid the detranslate/write altogether.

Just a note: Even without a second md5sum, you can avoid at least the
write. In theory, you can do the detranslation and md5sum in chunks in
memory without ever writing anything to disk. Like, having the
detranslation going into a (non-disk) stream and doing the md5sum over
that stream and then letting that stream go to /dev/null or such. That
was the theory. Don't know how difficult it is to implement that in
Subversion with the existing infrastructure.

> A possible problem is that the user may have edited an expanded
> keyword causing the md5sum comparison to indicate a modification,
> whereas the detranslate would drop that edit and indicate no
> modification.

IMHO, that's a separate issue. The first is how to recognize changes
reliably (and fast), the second is how to force submits even when
there is no actual change (with regard of what would be commited).

To me, changing a generated keyword part feels the same as changing
the timestamp. It indicates that there was some kind of change (and
that a check of the content may be due), but that may have been
effectively an no-op.

Whether we want to support a "no-change" commit and how it would be
triggered (a flag, some keyword change as suggested above, etc.) is a
different question and shouldn't influence how we want to detect real
changes. (If we get it cheaply, fine, but it shouldn't limit us.)

Bye,

	Benjamin.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Philip Martin <ph...@codematters.co.uk>.

Ben Collins-Sussman <su...@collab.net> writes:

> On Thu, 2004-04-22 at 14:28, Mike Mason wrote:
>
>> If Subversion is storing MD5s for the text bases anyhow, shouldn't the 
>> comparison use the MD5 instead of being byte-for-byte?
>
> Hm, but that would force us to always read the *entire* text-base file. 
> At the moment, we bail out as soon as we see a difference between the
> working and text-base files.
>
> My first thought was:  the current algorithm is faster than
> checksumming, because we can bail early.  Most changes aren't at the
> very end of a file.
>
> But then Karl pointed out that while "on average" our current algorithm
> bails after reading half the text-base, this is cancelled out by the
> fact that it's reading two files instead of one.  So maybe the
> byte-for-byte and checksum strategies come out even?  :-)

It's important to remember svn:keywords and svn:eol-style when
discussing working files and text bases.  When we do a byte-for-byte
comparison with those keywords set then first the entire working file
is read, it gets "detranslated" and a new, temporary, file in
repository format is written.  It's that temporary file that takes
part in the byte-for-byte comparison, and it's quite possible that the
write is the main performance hit.

A possible optimisation would be to store a second md5sum for the
working file, and then do an md5sum comparison by reading the working
file and thus avoid the detranslate/write altogether.  A possible
problem is that the user may have edited an expanded keyword causing
the md5sum comparison to indicate a modification, whereas the
detranslate would drop that edit and indicate no modification.

-- 
Philip Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Mike Mason <mg...@thoughtworks.net>.

Ben Collins-Sussman wrote:

>On Thu, 2004-04-22 at 14:28, Mike Mason wrote:
>
>  
>
>>If Subversion is storing MD5s for the text bases anyhow, shouldn't the 
>>comparison use the MD5 instead of being byte-for-byte?
>>    
>>
>
>Hm, but that would force us to always read the *entire* text-base file. 
>At the moment, we bail out as soon as we see a difference between the
>working and text-base files.
>
>My first thought was:  the current algorithm is faster than
>checksumming, because we can bail early.  Most changes aren't at the
>very end of a file.
>
>But then Karl pointed out that while "on average" our current algorithm
>bails after reading half the text-base, this is cancelled out by the
>fact that it's reading two files instead of one.  So maybe the
>byte-for-byte and checksum strategies come out even?  :-)
>

Well, I figure most of the time my working files are going to be cached 
by the operating system[1] because I've been working on them. CPUs are 
pretty fast so it's disk IO we're worried about here -- does that make a 
difference? As someone already pointed out I guess this isn't really 
that important unless you're storing big files that change content but 
not size (like, er, maybe a disk image).

Mike.

[1] For useful values of "operating system" -- Windows XP seems to like 
having 300 megs of free ram on my 1 gig machine and then endlessly 
grinding the disk for me.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Ben Collins-Sussman <su...@collab.net>.

On Thu, 2004-04-22 at 14:28, Mike Mason wrote:

> If Subversion is storing MD5s for the text bases anyhow, shouldn't the 
> comparison use the MD5 instead of being byte-for-byte?

Hm, but that would force us to always read the *entire* text-base file. 
At the moment, we bail out as soon as we see a difference between the
working and text-base files.

My first thought was:  the current algorithm is faster than
checksumming, because we can bail early.  Most changes aren't at the
very end of a file.

But then Karl pointed out that while "on average" our current algorithm
bails after reading half the text-base, this is cancelled out by the
fact that it's reading two files instead of one.  So maybe the
byte-for-byte and checksum strategies come out even?  :-)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Mike Mason <mg...@thoughtworks.net>.

Ben Collins-Sussman wrote:

> Andy Whitcroft wrote:
>
>> It appears that the client file modification detection is incorrectly 
>> checking modification time and assuming that if that is unchanged the 
>> file is unchanged even in the face of the size/inode number changing.
>
>
> That is exactly what's happening.  SVN uses the same algorithm as CVS 
> to detect if a file has changed:
>
>   if (file's timestamp == one recorded in .svn/entries)
>      return NOT_CHANGED
>   else if (size of file != size of .svn/text-base/file)
>      return CHANGED
>   else
>      compare file and .svn/text-base/file byte-by-byte.
>
> So what you're doing is 'mv'ing a different file with identical 
> timestamp on top of the file, and the algorithm is falling down.
>
> Developers: is this a legitimate hole in our algorithm?  Should we 
> change it to cover this edge-case?


As a datapoint, StarTeam, *spit*, uses timestamps as the default 
mechanism for detecting modifications, but has the option of using 
md5sums instead. With md5 turned on it runs a lot slower, but sometimes 
you do need it (mainly to work around other brokenness in StarTeam).

If Subversion is storing MD5s for the text bases anyhow, shouldn't the 
comparison use the MD5 instead of being byte-for-byte?

Cheers,
Mike.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: File modification detection?

Posted by Ben Collins-Sussman <su...@collab.net>.

Andy Whitcroft wrote:

> It appears that the client file modification detection is incorrectly 
> checking modification time and assuming that if that is unchanged the 
> file is unchanged even in the face of the size/inode number changing.

That is exactly what's happening.  SVN uses the same algorithm as CVS to 
detect if a file has changed:

   if (file's timestamp == one recorded in .svn/entries)
      return NOT_CHANGED
   else if (size of file != size of .svn/text-base/file)
      return CHANGED
   else
      compare file and .svn/text-base/file byte-by-byte.

So what you're doing is 'mv'ing a different file with identical 
timestamp on top of the file, and the algorithm is falling down.

Developers: is this a legitimate hole in our algorithm?  Should we 
change it to cover this edge-case?

I'm worried that the only way to prevent this is to always-stat the 
text-base (which will slow down things even further, since we're doing 
two file stats instead of one), or to record the text-base's size in our 
entries file.

Still, though, I could imagine somebody 'moving' a file with identical 
timestamp *and* identical size on top of the working file, and still 
fooling the algorithm.  So is this road even worth pursuing?  Is there 
some nice philosophical quip that allows us to brush off this whole 
scenario?  Something like, "don't fool with timestamps, kid".

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org