You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by John Szakmeister <jo...@szakmeister.net> on 2005/11/20 20:10:33 UTC

[PATCH] Extra integrity test for svnadmin dump/verify (was Re: svnadmin dump/verify not working as expected?)

On Tuesday 15 November 2005 09:18, kfogel@collab.net wrote:
[snip]
> That sounds sensible to me, if we're going to call this "verification" :-).
> Thanks for digging into the code to figure out what was happening.

I put together a patch that does an extra check to make sure the actual file 
length matches the length stored in the repository.  It doesn't come without 
a consequence though.  In this case, it means that if the length of the 
svndiff was small for a file, that you can no longer dump that revision in 
your repository (it would error out on both svnadmin verify and svnadmin 
dump).  I'm not sure how I feel about that.  One the one hand, you don't 
detect the potential corruption until you try to load from the dump.  On the 
other hand, without this fix, you get a dump file that can be modified and 
adjusted to match reality, even though it won't load without modification.

Any opinions?  Perhaps I should look at introducing another repos API 
specifically for verification?

I've attached the patch and log message for your viewing pleasure. :-)

-John

[[[
Add an extra integrity check to 'svnadmin dump' and 'svnadmin verify' by
ensuring the actual length of a file in the repository matches the expected
length.

* subversion/libsvn_repos/dump.c
  (dump_full_contents): New helper function.

  (dump_node): Use the new dump_full_contents() helper function when we're
  not dumping deltas.
]]]


Re: [PATCH] Extra integrity test for svnadmin dump/verify (was Re: svnadmin dump/verify not working as expected?)

Posted by John Szakmeister <jo...@szakmeister.net>.
On Sunday 20 November 2005 15:10, John Szakmeister wrote:
[snip]
> Any opinions?  Perhaps I should look at introducing another repos API
> specifically for verification?

I just realized I didn't really offer my own opinion, but I'm of the believe 
of finding problems sooner rather than later (we run svnadmin verify every 
night, but we don't currently attempt to reload empty repositories... we've 
got way too much data for that).  So I'd like 'svnadmin dump/verify' kick out 
an error when it runs into an issue (like it does for checksum mismatches).

If you guys don't mind, I'll go ahead and commit the change.

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: FSFS breakages

Posted by John Szakmeister <jo...@szakmeister.net>.
On Sunday 27 November 2005 18:11, Malcolm Rowe wrote:
> On Tue, Nov 22, 2005 at 06:33:34PM -0500, John Szakmeister wrote:
> > [lots of words about a bug in FSFS]
>
> Ok, there's quite a lot of information there.
>
> I've a couple of thoughts, some of which are probably obvious:
>
> * The fact that a lot of the problems were on Subversion 1.2.1, Redhat,
> and mod_dav_svn may be more reflective of the usage of those things in
> general, rather than being an indicator of any possible correlation with
> this problem.
>  - Just to eliminate the possibility, have you any idea whether the
>  RedHat/Fedora Subversion packages contain any local patches?

I have no idea... I don't use either one of those distributions, and I haven't 
had any time to go and chase down what they're doing. :-(

> * I'm not sure I fully understand the problem yet, but I can't see how
> failing hardware could necessarily have caused us to fail in the way I
> think you're indicating.

/me nods.  I absolutely agree, which is why I think we have an issue in the 
FSFS code. :-)

> * There don't appear to have been any major changes to FSFS since 1.2.0,
> so this bug, if it is in Subversion itself, probably still exists.

There was only one that I can remember, and that was a dir-caching bug.  IIRC, 
it only applied to directories that were deleted, so I can't really see how 
that would have anything to do with this particular situation.

>
> * You mentioned that you managed to get hold of the repository in some
> cases.  Did you try re-running the transaction to see if the problem
> was reproducible or not? (in the cases where you were able to restore
> the file's contents, naturally).

I only had the repository, not the original data.  In fact, in *every* case, 
they found out they had an issue far after the fact.  I crept up when the 
skip delta algorithm finally got around to using the revision containing the 
corrupted file. :-(  So no one had the original data around that caused the 
problem.  Otherwise, I would have definitely tried to replay the transaction.

> * You mentioned that it was only delta representations for binary files
> that were affected.  Were they particularly large?  Were the original
> files compressed?  (If they were compressed, the self-compressed deltas
> would tend to be close to the size of the file, I suspect - and depending
> upon the change, regular deltas might be as well)

Lots of questions there.  No, not all of the files were particularly large.  
In one case, then entire commit was about 2.5MB, which consisted solely of 
the single file that was self compressed and corrupted (it was a MS Word 
file).  In another case, one of the corrupted commits was 36MB in size.  The 
corrupted file was supposed to have a delta running 2.6MB in length, and the 
expanded length of the file was more on the order of 4.6MB.  Again, it was 
self compressed.  In another revision of the same repository, the commit was 
only 27MB in size.  This time it the rep was deltified against a previous 
version.  The affected file was supposed to have an svndiff running about 
2.7MB in length and have an expanded size of 27.5MB.

Actually, the delta algorithms we've been using are binary delta algorithms, 
so they actually do pretty decent if the file wasn't compressed to begin 
with.

>  - I wonder if the fact that it was binary files was just due to the
>  fact that, proportionally, they'd be more likely to take up most of
>  the space in a rev file.

Could be.  But I've got nothing to suggest that it's solely triggered by 
binary files, so I'm keeping an open mind about the problem. :-)  But that 
was the trend that I saw.

> To the specific problem:
> >  * In every case there was an extra block of data present in the svndiff.
> >  In one case, it appeared that the extra data was actually a repeat of
> > block elsewhere in the stream.
> >  * In every case the actual svndiff contents were fine (there were no bad
> >    instructions).  The windows themselves seemed to be complete.
> >  * In every case, all other offsets within the file pointed exactly where
> > they should (meaning that somehow the data was there when we wrote the
> > revision out).
> >  * In one case, I was actually able to recover the contents of the file
> >    completely (the very start of the svndiff stream was there).
>
> I'm not _quite_ sure I get exactly what the problem was.  When you say
> that there was an extra block of data in the svndiff, do you mean in the
> svndiff itself, or in the representation in the rev file?  In other words,
> was it the DELTA-ENDREP that was corrupt (containing a valid svndiff
> and something else), or was it the svndiff itself that was corrupt
> (containing garbage after the used 'new data', or similar).
>
> Where was the extra block of data?  Before, after, or inside the correct
> data?  Did it overwrite any valid data, or was it just 'extra'? (I
> guess if you couldn't recover the representation in all the cases,
> it was an overwrite?)

Good questions.  They're actually really hard to answer though.  There was 
only one instance in which the entire svndiff was found.  In this case, the 
extra block of data appeared before the real svndiff.  In the other cases, 
there were more windows than expected given the size of the delta in the text 
rep field.  As for whether it overwrote any data, I can't say. :-(  I also 
can't speak at to whether the extra block came before, after, or in the 
middle of the svndiff window.  What I can say is that the extra data all 
seemed to end on proper window boundaries.

> In one case, the extra data was a repeat (of what kind of data?).
> What was it in the other cases?

The case where there was repeated data, it was one of the sections of a DLL 
that was repeated (debugging symbol section IIRC).  I actually have the file 
around still, but I'm too tired and too lazy to look right now. :-)

In the other cases, I didn't really find a correlation with any of the data 
that was a part of the file, or the commit.  It seemed... more random.

> In addition to the extra data, what other problems did you see?  I think
> you mentioned that the node-rev had a 'text' <offset> that pointed in
> the wrong place?  What did it point to, if anything?

I've been tying both problem together.  In every case where this extra block 
of data was present, the text rep pointed past what appeared to be the start 
of the svndiff.  Often time it pointed to some place inside of one of the 
svndiff windows.  I'm fairly certain the amount of displacement was directly 
proportional to the size of the extra block, but I haven't had time nor 
sufficient access to broken repositories to prove that theory.

> I don't understand 'all other offsets pointed where they should' -
> what other offsets are you referring to? (that would indicate that the
> data was written correctly originally).

What I meant was that every other node rep in the file had valid offsets for 
their text and prop reps.  All this really means is that this extra block of 
data was present during the write.  Otherwise, we couldn't account for the 
extra svndiff data that's present (all the other offset would have been 
written as if thee data wasn't there).

>
> I've spent a while tonight taking a look at the FSFS write process --
> and it looks pretty straightforward.  Particularly, I can't see how
> 'DELTA' can be immediately followed by anything other than the start of
> an svndiff stream, nor how the offsets in 'text:' can be anything other
> than to what we've already written.

Yep, I've been through it several times myself.  I don't see it either. :-(

> The only thing I can really think of is that we're either corrupting one
> of the structures in memory - a pool lifetime issue, maybe? - or that
> we're corrupting the file when we rewrite it, at transaction commit time.
> Neither looks particularly likely.

I agree.  The problem right now is reproducibility.  I've tried everything I 
can think of.  I've even run the stress test for days with no avail.

> > I have notes if you want to see them. :-)
>
> Ok, yes.  It might be a waste of time (we might not be able to work out
> what broke), but then again, it might not.

There's probably still some stuff that's only in my head, but I'll give you 
what I have.  I also have some logs from a tool that I wrote to help fix 
these issues when I didn't have access to the repository.  It basically dumps 
all the data structures in the rev file.  You should be able to at least 
"see" which file was affected.  Unfortunately, I was only allowed to keep a 
corrupted rev file from one person.  I had to remove the remove another guys 
repository from my machine after I fixed it. :-(  That's actually been a real 
problem in trying to diagnose this issue.  I only had access to the rev file 
in one case.  I got access to the entire repository in another.  And I got 
nothing in the remaining two.  I wrote my little tool to help locate the 
issue and walked them through the repair process via email. :-(

I'm out of steam tonight, but I'll roll everything I can up and post it 
somewhere you can get to.  Expect to see a private mail with that URL some 
time tomorrow morning.  It'll be good to have another set of eyes looking at 
this.

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: FSFS breakages

Posted by Malcolm Rowe <ma...@farside.org.uk>.
On Tue, Nov 22, 2005 at 06:33:34PM -0500, John Szakmeister wrote:
> [lots of words about a bug in FSFS]

Ok, there's quite a lot of information there.

I've a couple of thoughts, some of which are probably obvious:

* The fact that a lot of the problems were on Subversion 1.2.1, Redhat,
and mod_dav_svn may be more reflective of the usage of those things in
general, rather than being an indicator of any possible correlation with
this problem.
 - Just to eliminate the possibility, have you any idea whether the
 RedHat/Fedora Subversion packages contain any local patches?

* I'm not sure I fully understand the problem yet, but I can't see how
failing hardware could necessarily have caused us to fail in the way I
think you're indicating.

* There don't appear to have been any major changes to FSFS since 1.2.0,
so this bug, if it is in Subversion itself, probably still exists.


* You mentioned that you managed to get hold of the repository in some
cases.  Did you try re-running the transaction to see if the problem
was reproducible or not? (in the cases where you were able to restore
the file's contents, naturally).

* You mentioned that it was only delta representations for binary files
that were affected.  Were they particularly large?  Were the original
files compressed?  (If they were compressed, the self-compressed deltas
would tend to be close to the size of the file, I suspect - and depending
upon the change, regular deltas might be as well)
 - I wonder if the fact that it was binary files was just due to the
 fact that, proportionally, they'd be more likely to take up most of
 the space in a rev file.

To the specific problem:
>  * In every case there was an extra block of data present in the svndiff.  In
>    one case, it appeared that the extra data was actually a repeat of block
>    elsewhere in the stream.
>  * In every case the actual svndiff contents were fine (there were no bad
>    instructions).  The windows themselves seemed to be complete.
>  * In every case, all other offsets within the file pointed exactly where they
>    should (meaning that somehow the data was there when we wrote the revision
>    out).
>  * In one case, I was actually able to recover the contents of the file
>    completely (the very start of the svndiff stream was there).
> 

I'm not _quite_ sure I get exactly what the problem was.  When you say
that there was an extra block of data in the svndiff, do you mean in the
svndiff itself, or in the representation in the rev file?  In other words,
was it the DELTA-ENDREP that was corrupt (containing a valid svndiff
and something else), or was it the svndiff itself that was corrupt
(containing garbage after the used 'new data', or similar).

Where was the extra block of data?  Before, after, or inside the correct
data?  Did it overwrite any valid data, or was it just 'extra'? (I
guess if you couldn't recover the representation in all the cases,
it was an overwrite?)

In one case, the extra data was a repeat (of what kind of data?).
What was it in the other cases?

In addition to the extra data, what other problems did you see?  I think
you mentioned that the node-rev had a 'text' <offset> that pointed in
the wrong place?  What did it point to, if anything?

I don't understand 'all other offsets pointed where they should' -
what other offsets are you referring to? (that would indicate that the
data was written correctly originally).


I've spent a while tonight taking a look at the FSFS write process --
and it looks pretty straightforward.  Particularly, I can't see how
'DELTA' can be immediately followed by anything other than the start of
an svndiff stream, nor how the offsets in 'text:' can be anything other
than to what we've already written.

The only thing I can really think of is that we're either corrupting one
of the structures in memory - a pool lifetime issue, maybe? - or that
we're corrupting the file when we rewrite it, at transaction commit time.
Neither looks particularly likely.

> I have notes if you want to see them. :-)
> 

Ok, yes.  It might be a waste of time (we might not be able to work out
what broke), but then again, it might not.

Regards,
Malcolm

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: FSFS breakages

Posted by John Szakmeister <jo...@szakmeister.net>.
On Tuesday 22 November 2005 11:19, Malcolm Rowe wrote:
> On Tue, Nov 22, 2005 at 10:18:36AM -0500, John Szakmeister wrote:
> > man-made mistake.  In the wild, the FSFS corruptions that I have seen
> > have had broken svndiffs, or the text rep was pointing to the wrong
> > location, or a mix of both.
>
> Out of interest, do we have any idea why these problems are occurring?
> (Particular filesystems, distributions, versions of Subversion?).
> Is it something that a broken client can cause?

It's not clear whether a broken client can induce the problem or not.  I've 
helped 3 people on the users list recover their repository, 1 other fixed it 
himself, and one other mentioned a similar issue but didn't have the old 
broken one around any more to examine.  In one case I got access to the 
entire repository.  In another case, I was only given access to the broken 
revision.  In the 3rd case, we used my tool to locate the problem, and the 
person patched it by hand.  In the 4 cases, I was able to inquire about their 
setup, and I also obtained logs of information about the broken revision 
using my tool (it basically pulls out the various metadata, and decodes the 
svndiff window, but it doesn't have any actual data in the output so it made 
people more comfortable when having to deal with proprietary data).  Here's 
what I've seen so far:
 * At least 4 instances involved Subversion 1.2.1 on the server-side
 * Most were using some form of Redhat or Fedora Core
 * All were using mod_dav_svn in the backend
 * In every case it was a delta being stored (whether it be self-compressed
   or against another revision).
 * In every case the text rep offset was pointing after the expected start
   of the representation.
 * In every case there was an extra block of data present in the svndiff.  In
   one case, it appeared that the extra data was actually a repeat of block
   elsewhere in the stream.
 * In every case the actual svndiff contents were fine (there were no bad
   instructions).  The windows themselves seemed to be complete.
 * In every case, there was only one file that exhibited this corruption, all
   other files were fine.
 * In every case, all other offsets within the file pointed exactly where they
   should (meaning that somehow the data was there when we wrote the revision
   out).
 * In every case, it was a file that was affected.  Moreover, it was a binary
   file that was affected.
 * In one case, I was actually able to recover the contents of the file
   completely (the very start of the svndiff stream was there).

> One of FSFS's benefits is its write-only-ness.  That should make it harder
> for corruption to occur, but it doesn't help if we're we're writing a
> corrupt revision in the first place!

I think you mean "read-only-ness", but I definitely agree.  I think there is 
something subtly wrong, but I haven't been able to turn up the culprit.  
Having the extra data block but all the offset turn out alright implies (to 
me anyways) that we did something wrong.

I will say that in 2 cases, they discovered failing hardware (the hard drive 
in both cases).  However, I'd expect the corruption to be more devastating, 
if that was truly the root cause of the problem.  Also, most everyone only 
had one such corruption, except for the last guy that I helped who had 3 such 
problems.

I have notes if you want to see them. :-)

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: FSFS breakages

Posted by Malcolm Rowe <ma...@farside.org.uk>.
On Tue, Nov 22, 2005 at 10:18:36AM -0500, John Szakmeister wrote:
> man-made mistake.  In the wild, the FSFS corruptions that I have seen
> have had broken svndiffs, or the text rep was pointing to the wrong
> location, or a mix of both.

Out of interest, do we have any idea why these problems are occurring?
(Particular filesystems, distributions, versions of Subversion?).
Is it something that a broken client can cause?

One of FSFS's benefits is its write-only-ness.  That should make it harder
for corruption to occur, but it doesn't help if we're we're writing a
corrupt revision in the first place!

Regards,
Malcolm

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Extra integrity test for svnadmin dump/verify (was Re: svnadmin dump/verify not working as expected?)

Posted by John Szakmeister <jo...@szakmeister.net>.
David James wrote:
> On 11/22/05, John Szakmeister <jo...@szakmeister.net> wrote:
>> On Tuesday 22 November 2005 04:54, Malcolm Rowe wrote:
>>> On Sun, Nov 20, 2005 at 03:10:33PM -0500, John Szakmeister wrote:
>>>> a consequence though.  In this case, it means that if the length of the
>>>> svndiff was small for a file, that you can no longer dump that revision
>>>> in your repository (it would error out on both svnadmin verify and
>>>> svnadmin dump).
>>> If someone comes to svn-dev with a repository that 'works', but fails
>>> verification due to this problem, how do we repair it, given that 'dump'
>>> no longer works?
>> The same we fix other issues: edit the backend data structures.  Some of us
>> have done this on the users@ list a number of times. :-)
>>
>>> I like the idea of flagging potential problems as soon as possible, but
>>> not the idea of blocking 'svnadmin dump' when we don't really need to.
>>> Is there any way we can convert this hard error into a warning?  (Yes,
>>> I realise that might be difficult, given that it's code in libsvn_repos,
>>> not svnadmin itself).
>> I'm not a fan of it either, but we already have a precedent for it.  Change a
>> byte in the plain text representation of a file, and you'll run into the same
>> issue.  *shrug*
> Can we add a "--no-verify" flag or similar so that users can output
> broken dumps for debugging purposes?

Possibly.  The problem is that if you really want everything (not just
to skip the extra check this patch adding), we'd have to change a number
of things.  For instance, right now we get a stream pointer to the file
contents when doing the dump.  It's actually the stream implementation
that's doing the MD5 checksum.  We'd have to expose another API, or
provide a flag to the get file contents function in order to choose an
alternate stream implementation that doesn't include the checksumming.
I think there are several other similar issues too.

I'm also not sure that being able to produce a broken dump file is more
helpful than editing the backend directly.  At least by editing the
backend directly you stand a chance at recovering everything, and you
get a glimpse and the real corruption.  With the dump, we're just going
to dump what we think is there, and anything ancillary to that is lost.

I've actually been working on a tool to help repair/fix broken FSFS
repositories.  Perhaps I should just include this extra check there, and
leave well enough alone. :-)  To be honest, this is protecting against a
man-made mistake.  In the wild, the FSFS corruptions that I have seen
have had broken svndiffs, or the text rep was pointing to the wrong
location, or a mix of both.  I was just surprised to see that when I
truncated the file, that the MD5 was no longer being compared.  I didn't
mean to open a can of worms. :-)

-John


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Extra integrity test for svnadmin dump/verify (was Re: svnadmin dump/verify not working as expected?)

Posted by David James <ja...@cs.toronto.edu>.
On 11/22/05, John Szakmeister <jo...@szakmeister.net> wrote:
> On Tuesday 22 November 2005 04:54, Malcolm Rowe wrote:
> > On Sun, Nov 20, 2005 at 03:10:33PM -0500, John Szakmeister wrote:
> > > a consequence though.  In this case, it means that if the length of the
> > > svndiff was small for a file, that you can no longer dump that revision
> > > in your repository (it would error out on both svnadmin verify and
> > > svnadmin dump).
> >
> > If someone comes to svn-dev with a repository that 'works', but fails
> > verification due to this problem, how do we repair it, given that 'dump'
> > no longer works?
>
> The same we fix other issues: edit the backend data structures.  Some of us
> have done this on the users@ list a number of times. :-)
>
> > I like the idea of flagging potential problems as soon as possible, but
> > not the idea of blocking 'svnadmin dump' when we don't really need to.
> > Is there any way we can convert this hard error into a warning?  (Yes,
> > I realise that might be difficult, given that it's code in libsvn_repos,
> > not svnadmin itself).
>
> I'm not a fan of it either, but we already have a precedent for it.  Change a
> byte in the plain text representation of a file, and you'll run into the same
> issue.  *shrug*
Can we add a "--no-verify" flag or similar so that users can output
broken dumps for debugging purposes?

Cheers,

David

--
David James -- http://www.cs.toronto.edu/~james

Re: [PATCH] Extra integrity test for svnadmin dump/verify (was Re: svnadmin dump/verify not working as expected?)

Posted by John Szakmeister <jo...@szakmeister.net>.
On Tuesday 22 November 2005 04:54, Malcolm Rowe wrote:
> On Sun, Nov 20, 2005 at 03:10:33PM -0500, John Szakmeister wrote:
> > a consequence though.  In this case, it means that if the length of the
> > svndiff was small for a file, that you can no longer dump that revision
> > in your repository (it would error out on both svnadmin verify and
> > svnadmin dump).
>
> If someone comes to svn-dev with a repository that 'works', but fails
> verification due to this problem, how do we repair it, given that 'dump'
> no longer works?

The same we fix other issues: edit the backend data structures.  Some of us 
have done this on the users@ list a number of times. :-)

> I like the idea of flagging potential problems as soon as possible, but
> not the idea of blocking 'svnadmin dump' when we don't really need to.
> Is there any way we can convert this hard error into a warning?  (Yes,
> I realise that might be difficult, given that it's code in libsvn_repos,
> not svnadmin itself).

I'm not a fan of it either, but we already have a precedent for it.  Change a 
byte in the plain text representation of a file, and you'll run into the same 
issue.  *shrug*

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: [PATCH] Extra integrity test for svnadmin dump/verify (was Re: svnadmin dump/verify not working as expected?)

Posted by Malcolm Rowe <ma...@farside.org.uk>.
On Sun, Nov 20, 2005 at 03:10:33PM -0500, John Szakmeister wrote:
> a consequence though.  In this case, it means that if the length of the 
> svndiff was small for a file, that you can no longer dump that revision in 
> your repository (it would error out on both svnadmin verify and svnadmin 
> dump).

If someone comes to svn-dev with a repository that 'works', but fails
verification due to this problem, how do we repair it, given that 'dump'
no longer works?

I like the idea of flagging potential problems as soon as possible, but
not the idea of blocking 'svnadmin dump' when we don't really need to.
Is there any way we can convert this hard error into a warning?  (Yes,
I realise that might be difficult, given that it's code in libsvn_repos,
not svnadmin itself).

Regards,
Malcolm

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org