You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@subversion.apache.org by Benjamin Pflugmann <be...@pflugmann.de> on 2003/02/01 23:17:28 UTC

checksums, another data point

Hi.

After 2 weeks of not having not much time, I tried to update my
working copy of the svn tree, and got the dreaded checksum error.

The interesting part is that I got exactly the same as Garrett
reported earlier here:

  http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=30230

I read the whole checksums threads in the archive again to be sure
that I did not miss something the first time, but it seems that the
thread above simply died away?

Anyhow, as I said, I have exactly the same problem (win-tests.py),
same checksums (expected: f9c9027cfe19f92f7f6e1b9d3d316acf, actual:
9f30bad101c64a873b0f2c05d529924c) and so on.

I last updated my tree on the 18th to r4420. Unfortunately, I cannot
say for sure anymore, from which version I updated. I recompiled
subversion that day, so I am currently using r4420, when I run into
that problem. For updating to r4420 I used a binary from r4250.

As the mentioned thread says, both my files, the base and the working
copy are with UNIX line ending, matching the "actual" checksum and
when I replace \n by \r\n I get the expected checksum.

As shown in the post from David, both checksums can actually be found
in the repository:

$ svn cat http://svn.collab.net/repos/svn/trunk/win-tests.py -r 4371 | openssl md5
9f30bad101c64a873b0f2c05d529924c
$ svn cat http://svn.collab.net/repos/svn/trunk/win-tests.py -r 4105 | openssl md5
f9c9027cfe19f92f7f6e1b9d3d316acf

The changlog shows (which apparently nobody mentioned until now):

$svn log -r 4371 .
------------------------------------------------------------------------
rev 4371:  rassilon | 2003-01-13 23:43:13 +0100 (Mon, 13 Jan 2003) | 3 lines

* Just about everything else that wasn't test output, a binary file,
or already to set to CRLF: Set svn:eol-style to native.

------------------------------------------------------------------------

So it seems, that when I updated to r4420 (with the binary compiled
from r4250), svn failed to update the checksum correctly.


Oh! My! God!

After trying for more than half an hour to reproduce the behaviour
using the older binaries I have still lying around, I just noticed the
filename for the first time really: win-test.py. Sounds familiar?
Perhaps from

  http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgId=219941
  ("READ ME: working copy ickiness!" from Ben)

It seems that changing text-bases isn't a good idea when md5sums are
used in parallel. :-)


Although I have to admit, I still get a headache when I try to follow
when the text-base resp. working copy had which line ending and why, I
am pretty sure I have found the source of the problem in this case.

And the solution is: don't mess around with the text-base, even when
instructed to do so. ;-) Or, well, alternatively, maybe update the
checksum, too.

And does 

  http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=30132
  ("win-tests.py broken in the repository?")

have influence on this (I think not, because the the sums should be
correct for whatever is stored, but maybe the eol-conversion gets in
between)? Or was it all about this last issue and I just embarrassed
myself? ;)

HTH,

	Benjamin.

Re: checksums, another data point

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

Benjamin Pflugmann <be...@pflugmann.de> writes:
> And the solution is: don't mess around with the text-base, even when
> instructed to do so. ;-) Or, well, alternatively, maybe update the
> checksum, too.
> 
> And does 
> 
>   http://subversion.tigris.org/servlets/ReadMsg?list=dev&msgNo=30132
>   ("win-tests.py broken in the repository?")
> 
> have influence on this (I think not, because the the sums should be
> correct for whatever is stored, but maybe the eol-conversion gets in
> between)? Or was it all about this last issue and I just embarrassed
> myself? ;)

Just check out a new working copy and try again :-).  win-tests.py got
seriously messed up, for reasons unrelated to the checksum code (the
checksum code just dectects the messed-upedness).

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: why the change in the checksums

Posted by Greg Stein <gs...@lyra.org>.

On Sun, Feb 02, 2003 at 08:20:47PM -0500, mark benedetto king wrote:
> On Sun, Feb 02, 2003 at 08:36:37AM -0800, solo turn wrote:
> > does somebody know why exactly the "checksumming, more checksumming"
> > was introduced?
> > 
> 
> We've seen network data corruption.  Correct IP checksums, incorrect
> data.  It's bound to happen: 16 bits of checksum is just not enough.
> 
> Maybe that's okay for web-surfing, but it's not okay for svn.  Application
> level data integrity checking is a requirement.

Yup. There are any number of avenues for corruption which are totally
outside of our control (bad RAM, bad disk, etc). It does and it will happen.
And Subversion can now detect it.

I think ben understates a key point: a version control system simply CANNOT
lose data. Trust is paramount. Checksums are one way to watch out (or at
least, detect) corruption. One day, Subversion will goof, and we'll be happy
for those checksums.

[ I tell ya, though... we're already well ahead of systems like SourceSafe
  or ClearCase, where all the logic is on the client side; any tweaky thing
  on the client or the network... blam. admins specifically run procedures
  to watch for and deal with corruption in those systems ]

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: why the change in the checksums

Posted by mark benedetto king <mb...@boredom.org>.

On Sun, Feb 02, 2003 at 08:36:37AM -0800, solo turn wrote:
> does somebody know why exactly the "checksumming, more checksumming"
> was introduced?
> 

We've seen network data corruption.  Correct IP checksums, incorrect
data.  It's bound to happen: 16 bits of checksum is just not enough.

Maybe that's okay for web-surfing, but it's not okay for svn.  Application
level data integrity checking is a requirement.

--ben

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: why the change in the checksums

Posted by Zack Weinberg <za...@codesourcery.com>.

Kevin Pilch-Bisson <ke...@pilch-bisson.net> writes:

> I merely meant to point out that there isn't a single reason why a
> checksum error can occur, so we can't go blindly recovering from
> checksum errors.

Yes.  Each situation where a checksum error can occur needs to be
thought about individually, and a specific recovery strategy designed;
in the absence of that, 'punt to the user' is always safe.

I think that forcing the repository read-only on an indication of
intra-repository data corruption - checksum errors or otherwise - is
something that should be implemented before 1.0, since it will limit
damage caused by bugs or hardware faults.  The other checksum-error
situations are more user convenience issues, so they can wait.

zw

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: why the change in the checksums

Posted by Kevin Pilch-Bisson <ke...@pilch-bisson.net>.

Zack's eminently feasible automatic steps snipped.

I agree, and did actually think of those solutions to 2 and 3, I didn't
think of the one about setting the repos to read-only in case 1.  

I merely meant to point out that there isn't a single reason why a checksum
error can occur, so we can't go blindly recovering from checksum errors.

In the long term I'd love to see something like your proposal happen, although
I can't see anyone having the time to work on it pre-1.0.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kevin Pilch-Bisson                    http://www.pilch-bisson.net
     "Historically speaking, the presences of wheels in Unix
     has never precluded their reinvention." - Larry Wall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Re: why the change in the checksums

Posted by Zack Weinberg <za...@codesourcery.com>.

Kevin Pilch-Bisson <ke...@pilch-bisson.net> writes:

>> and what are the actions on the checksumming failures?
>> 
> That's really hard to say, and is usually at the discretion of the user, and
> is not possible to accurately automate.

I think that's an overstatement.  There are sensible error-recovery
actions for your three examples -- all of them have to involve the
user, but the computer can help.

> 1) Mismatch of fulltext checksum in repository.  Probably means either
> database corruption or disk failure.  Solution: Revert to a backup.
> Automatable: No.

Agree that the actual recovery is not automatable, but automatic
actions can prevent further damage and shorten the outage window.

When such a mismatch is detected, the repository locks down to
read-only access; transactions in progress are aborted with
unambiguous error message displayed on user's terminal; and an
e-mail goes to the system administrator reporting the problem.

> 2) Mismatch of fulltext checksum over the wire.  Probably means a broken
> TCP/IP implementation.  Solution: fix it/switch ISPs.  Automatable: no.

This is likely to be an intermittent fault.  On the server side, roll
back any write operations not yet committed.  On the client side,
display a warning and retry the operation.  If it fails a second time,
give up.  Dump a detailed error log to a file in /tmp and tell the
user its name.

> 3) Mismatch of text-base checksum in working copy.  Probably means
> either 1) a bug in svn's current code (in the short term) or 2) The
> user somehow managed to edit their text-base copy.  Solution: Get a
> new copy of the text base from the repository that is not corrupt.
> Automatable: yes, EXCEPT What if the edits were a weeks worth of
> change to a source file (the person didn't realize that it was the
> text-base version they were editing).  If we replace it
> automatically with the text-base version, the user is going to be
> mighty pissed.  Thus in practice this is not automatable either.

Rename the corrupted text-base file out of the way, tell the user the
new name, and proceed to fetch a new copy.  Pick a naming convention
for renamed corrupt files that facilitates debugging, and allows
saving several copies of the same file (i.e. if the same file gets
corrupted several times, we should keep all the corrupted copies).

zw

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: why the change in the checksums

Posted by Kevin Pilch-Bisson <ke...@pilch-bisson.net>.

On Sun, Feb 02, 2003 at 08:36:37AM -0800, solo turn wrote:
> does somebody know why exactly the "checksumming, more checksumming"
> was introduced?

As one of the people who was originally (and still is) very much in favour of
the checksum code, I think I'll post as much of an explanation as I can.

The purpose of a Revision Control System is to maintain an EXACT copy of all
of your versions of files.  If it fails in the EXACT part, then the rest of
the system is irrelevant, because the system is worthless.

So, for each revision of a file, we store the checksum of the file in the
repository.  This allows you to do something like search for database
corruption or disk failures via a cron-job that retreives each revision of a
file and its checksum, and compares it against the stored checksum.  If there
is a mis-match, then that indicates a problem.

We also send the checksum over the wire during updates/checkouts/commits.
This lets us determine whether or not there was a network error in
transmission.  Don't try and tell me that TCP already does this, because in
practice, it doesn't[1].

Finally, we store a copy of the checksum of the text-base file, so that we can
detect if the text-base copy of something has become corrupt.  This is
important, because we send binary diffs against that text-base during 
network operations, so we can't afford to have it be corrupted.
> 
> and what are the actions on the checksumming failures?
> 
That's really hard to say, and is usually at the discretion of the user, and
is not possible to accurately automate.

Let's take an example for each of the types of checksums.

1) Mismatch of fulltext checksum in repository.  Probably means either
database corruption or disk failure.  Solution: Revert to a backup.
Automatable: No.

2) Mismatch of fulltext checksum over the wire.  Probably means a broken
TCP/IP implementation.  Solution: fix it/switch ISPs.  Automatable: no.

3) Mismatch of text-base checksum in working copy.  Probably means either 1) a
bug in svn's current code (in the short term)  or 2) The user somehow managed
to edit their text-base copy.  Solution: Get a new copy of the text base from
the repository that is not corrupt.  Automatable: yes, EXCEPT  What if the
edits were a weeks worth of change to a source file (the person didn't realize
that it was the text-base version they were editing).  If we replace it
automatically with the text-base version, the user is going to be mighty
pissed.  Thus in practice this is not automatable either.
> we are a little stuck currently cause we don't know if we should
> upgrade from 16.0 and what happens then ....

You get to know about these types of failures, instead of your repository
becoming silently corrupted.  Yes in the short term, you may be subjected to
the occasional checksum mismatch caused by a bug in subversion, but I think
that is a small price to pay in the long run.

[1]  I once had and ISP that performed NAT for me.  Having written an
implementation of NAT, I know that if you change the address/port of something
in TCP, you need to recalculate the TCP checksum.  Problem was, they were
re-calculating the checksum and passing the packets along regardless of
whether or not the original checksum was valid.  Thus, I checked out
subversion, and got a corrupted working copy that wouldn't even build.

-- 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kevin Pilch-Bisson                    http://www.pilch-bisson.net
     "Historically speaking, the presences of wheels in Unix
     has never precluded their reinvention." - Larry Wall
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Re: why the change in the checksums

Posted by Karl Fogel <kf...@newton.ch.collab.net>.

solo turn <so...@yahoo.com> writes:
> does somebody know why exactly the "checksumming, more checksumming"
> was introduced?

Yep -- to detect data corruption.

> and what are the actions on the checksumming failures?

It depends on the circumstances.  If everything's going right, you
shouldn't get a checksum failure.  If something went wrong, then the
action to take depends on what exactly went wrong.

> we are a little stuck currently cause we don't know if we should
> upgrade from 16.0 and what happens then ....

You should upgrade both client and server to 0.17.1, or to HEAD, imho.
If you follow the checksum threads, most of the failures are due to
the win-tests.py debacle (which was a case of checksums correctly
detecting badness, i.e., behaving as designed), and one of them is
still unexplained (Lele Gaifax's), but I can't reproduce it with
0.17.1 or higher, see the thread "Re: checksumming crap".

Bottom line is, upgrade everything.  You want the checksums.  Without
them, you risk undetected data corruption (witness win-tests.py).

I don't expect you to have any problems, but if you do (and we can get
a reproducible recipe), then we'll fix it.

-K

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

why the change in the checksums

Posted by solo turn <so...@yahoo.com>.

does somebody know why exactly the "checksumming, more checksumming"
was introduced?

and what are the actions on the checksumming failures?

we are a little stuck currently cause we don't know if we should
upgrade from 16.0 and what happens then ....

-s.

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org