You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Philip Martin <ph...@wandisco.com> on 2012/07/30 12:40:01 UTC

rep-cache sanity check on commit

When the commit process finds a representation in the rep-cache the only
sanity check that happens is that the revision must be less than or
equal to HEAD.  We don't check that the offset is valid:

 echo foo > foo
 svnadmin create repo
 svn import -mm foo file://`pwd`/repo/A/f
 sqlite3 repo/db/rep-cache.db "update rep_cache set offset = 4"
 svn import -mm foo file://`pwd`/repo/A/g

or that the checksum at that offset matches:

 echo foo > foo
 echo bar > bar
 svnadmin create repo
 svn import -mm foo file://`pwd`/repo/A/f
 sqlite3 repo/db/rep-cache.db "update rep_cache set hash='e242ed3bffccdf271b7fbaf34ed72d089537b42f'"
 svn import -mm bar file://`pwd`/repo/A/g

In both cases corruption in the rep-cache leads to corruption in the
revision files but that corruption is not detected by commit process
even though subsequent checkouts will fail.

Should we do more sanity checking?  We are using rep-cache to discard
data supplied by the client on the basis that it is already present in
the repository.  Should we check that the offset really is a representation
with the expected checksum?

A side issue: "svnadmin verify" doesn't detect the corruption in the
second case even though checkout will fail.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: rep-cache sanity check on commit

Posted by Philip Martin <ph...@wandisco.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

> On Tue, Jul 31, 2012 at 11:26 AM, Philip Martin
>
>> We do not need to verify the whole revision or the whole
>> rep-cache.
>>
>
> I only thought that it is hard to verify that a given offset
> range actually *is* a representation. To verify that, there
> must be some noderev pointing to it. Thus, we had to
> enumerate all noderevs until we found a suitable one.

I see what you mean.  If we take an offset from the rep-cache and look
at the data at that offset it may look like a valid representation even
if it is just arbitrary data within another representation.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: rep-cache sanity check on commit

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Wed, Aug 1, 2012 at 1:17 PM, Philip Martin <ph...@wandisco.com>wrote:

> Stefan Fuhrmann <st...@wandisco.com> writes:
>
> > On Tue, Jul 31, 2012 at 11:26 AM, Philip Martin
> >
> >> I suppose we might want to
> >> construct the full-text corresponding to the representation to verify
> >> the checksum.
> >
> >
> > Yes. Worst thing that might happen is a crash because
> > we didn't check some buffer size. combined_window_cache
> > will reduce the I/O overhead for frequently checked reps
> > such as file properties.
>
> If this could crash then we want it to crash during the commit rather
> than on subsequent access after an apparently successful commit.
>

+1. But there is a problem: it will (most likely)
crash *every time* the user tries to commit the
triggering content.

So, we should put something like this into our docs:
* if you see repeated crashes in commit, run svnadmin verify
* if that crashes too, delete the rep-cache.db

-- Stefan^2.

-- 
*Join us this October for Subversion Live
2012<http://www.wandisco.com/svn-live-2012>– 2 full days of training,
networking, live demos and more! 25%
off before Aug. 10th with discount code “earlybird.”

*Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: rep-cache sanity check on commit

Posted by Philip Martin <ph...@wandisco.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

> On Tue, Jul 31, 2012 at 11:26 AM, Philip Martin
>
>> I suppose we might want to
>> construct the full-text corresponding to the representation to verify
>> the checksum.
>
>
> Yes. Worst thing that might happen is a crash because
> we didn't check some buffer size. combined_window_cache
> will reduce the I/O overhead for frequently checked reps
> such as file properties.

If this could crash then we want it to crash during the commit rather
than on subsequent access after an apparently successful commit.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: rep-cache sanity check on commit

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Tue, Jul 31, 2012 at 11:26 AM, Philip Martin
<ph...@wandisco.com>wrote:

> Stefan Fuhrmann <st...@wandisco.com> writes:
>
> > On Mon, Jul 30, 2012 at 12:40 PM, Philip Martin
> > <ph...@wandisco.com>wrote:
> >
> >> When the commit process finds a representation in the rep-cache the only
> >> sanity check that happens is that the revision must be less than or
> >> equal to HEAD.  We don't check that the offset is valid:
> >>
> >>  echo foo > foo
> >>  svnadmin create repo
> >>  svn import -mm foo file://`pwd`/repo/A/f
> >>  sqlite3 repo/db/rep-cache.db "update rep_cache set offset = 4"
> >>  svn import -mm foo file://`pwd`/repo/A/g
> >>
> >> or that the checksum at that offset matches:
> >>
> >>  echo foo > foo
> >>  echo bar > bar
> >>  svnadmin create repo
> >>  svn import -mm foo file://`pwd`/repo/A/f
> >>  sqlite3 repo/db/rep-cache.db "update rep_cache set
> >> hash='e242ed3bffccdf271b7fbaf34ed72d089537b42f'"
> >>  svn import -mm bar file://`pwd`/repo/A/g
> >>
> >> In both cases corruption in the rep-cache leads to corruption in the
> >> revision files but that corruption is not detected by commit process
> >> even though subsequent checkouts will fail.
> >>
> >
> > Has that kind of corruption been observed in the wild?
>
> I'm not aware of any reports.  I did start looking at this because of a
> repository with an incorrect offset referring to an older revision file
> but that now appears to be a different problem.
>
> >> Should we do more sanity checking?  We are using rep-cache to discard
> >> data supplied by the client on the basis that it is already present in
> >> the repository.  Should we check that the offset really is a
> representation
> >> with the expected checksum?
> >>
> >
> > The full verification would look like this:
> > * recursively enumerate all noderevs in the rep's revision
> > * check that at least one uses the rep
> > * read the rep and verify the checksum
> >
> > This seems quite costly to do during commit - in particular during
> > imports and similar mass commit operations.
>
> On commit we would only want to verify that a revision/offset obtained
> from the cache really was a representation.


That is hard to truly verify. But from a practical POV,
you are right - we can check whether (offset, length)
points to something that

* fully lies within the respective file
* starts with "PLAIN\n", "DELTA\n" or "DELTA $someValidRepReference"
* ends with "ENDREP\n"

Please do *not* check that offsets lie within a specific
range but only within the respective file / pack file.
I'm working on an experimental tool that will reorder
the elements within pack files and there will be no
upper limit for representation offsets of a given revision
other than the pack file length itself.


> I suppose we might want to
> construct the full-text corresponding to the representation to verify
> the checksum.


Yes. Worst thing that might happen is a crash because
we didn't check some buffer size. combined_window_cache
will reduce the I/O overhead for frequently checked reps
such as file properties.


> We do not need to verify the whole revision or the whole
> rep-cache.
>

I only thought that it is hard to verify that a given offset
range actually *is* a representation. To verify that, there
must be some noderev pointing to it. Thus, we had to
enumerate all noderevs until we found a suitable one.

-- Stefan^2.

-- 
*Join us this October for Subversion Live
2012<http://www.wandisco.com/svn-live-2012>– 2 full days of training,
networking, live demos and more! 25%
off before Aug. 10th with discount code “earlybird.”

*Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: rep-cache sanity check on commit

Posted by Philip Martin <ph...@wandisco.com>.
Stefan Fuhrmann <st...@wandisco.com> writes:

> On Mon, Jul 30, 2012 at 12:40 PM, Philip Martin
> <ph...@wandisco.com>wrote:
>
>> When the commit process finds a representation in the rep-cache the only
>> sanity check that happens is that the revision must be less than or
>> equal to HEAD.  We don't check that the offset is valid:
>>
>>  echo foo > foo
>>  svnadmin create repo
>>  svn import -mm foo file://`pwd`/repo/A/f
>>  sqlite3 repo/db/rep-cache.db "update rep_cache set offset = 4"
>>  svn import -mm foo file://`pwd`/repo/A/g
>>
>> or that the checksum at that offset matches:
>>
>>  echo foo > foo
>>  echo bar > bar
>>  svnadmin create repo
>>  svn import -mm foo file://`pwd`/repo/A/f
>>  sqlite3 repo/db/rep-cache.db "update rep_cache set
>> hash='e242ed3bffccdf271b7fbaf34ed72d089537b42f'"
>>  svn import -mm bar file://`pwd`/repo/A/g
>>
>> In both cases corruption in the rep-cache leads to corruption in the
>> revision files but that corruption is not detected by commit process
>> even though subsequent checkouts will fail.
>>
>
> Has that kind of corruption been observed in the wild?

I'm not aware of any reports.  I did start looking at this because of a
repository with an incorrect offset referring to an older revision file
but that now appears to be a different problem.

>> Should we do more sanity checking?  We are using rep-cache to discard
>> data supplied by the client on the basis that it is already present in
>> the repository.  Should we check that the offset really is a representation
>> with the expected checksum?
>>
>
> The full verification would look like this:
> * recursively enumerate all noderevs in the rep's revision
> * check that at least one uses the rep
> * read the rep and verify the checksum
>
> This seems quite costly to do during commit - in particular during
> imports and similar mass commit operations.

On commit we would only want to verify that a revision/offset obtained
from the cache really was a representation. I suppose we might want to
construct the full-text corresponding to the representation to verify
the checksum. We do not need to verify the whole revision or the whole
rep-cache.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: rep-cache sanity check on commit

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Mon, Jul 30, 2012 at 12:40 PM, Philip Martin
<ph...@wandisco.com>wrote:

> When the commit process finds a representation in the rep-cache the only
> sanity check that happens is that the revision must be less than or
> equal to HEAD.  We don't check that the offset is valid:
>
>  echo foo > foo
>  svnadmin create repo
>  svn import -mm foo file://`pwd`/repo/A/f
>  sqlite3 repo/db/rep-cache.db "update rep_cache set offset = 4"
>  svn import -mm foo file://`pwd`/repo/A/g
>
> or that the checksum at that offset matches:
>
>  echo foo > foo
>  echo bar > bar
>  svnadmin create repo
>  svn import -mm foo file://`pwd`/repo/A/f
>  sqlite3 repo/db/rep-cache.db "update rep_cache set
> hash='e242ed3bffccdf271b7fbaf34ed72d089537b42f'"
>  svn import -mm bar file://`pwd`/repo/A/g
>
> In both cases corruption in the rep-cache leads to corruption in the
> revision files but that corruption is not detected by commit process
> even though subsequent checkouts will fail.
>

Has that kind of corruption been observed in the wild?


> Should we do more sanity checking?  We are using rep-cache to discard
> data supplied by the client on the basis that it is already present in
> the repository.  Should we check that the offset really is a representation
> with the expected checksum?
>

The full verification would look like this:
* recursively enumerate all noderevs in the rep's revision
* check that at least one uses the rep
* read the rep and verify the checksum

This seems quite costly to do during commit - in particular during
imports and similar mass commit operations.

A side issue: "svnadmin verify" doesn't detect the corruption in the
> second case even though checkout will fail.
>

We should fix that, i.e. detect this kind of corruption.
We could add a revision index to rep-cache.db, enumerate
all reps in the revision file and compare them to the db content.

-- Stefan^2.

-- 
*Join us this October for Subversion Live
2012<http://www.wandisco.com/svn-live-2012>– 2 full days of training,
networking, live demos and more! 25%
off before Aug. 10th with discount code “earlybird.”

*Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: rep-cache sanity check on commit

Posted by Philip Martin <ph...@wandisco.com>.
Philip Martin <ph...@wandisco.com> writes:

> When the commit process finds a representation in the rep-cache the only
> sanity check that happens is that the revision must be less than or
> equal to HEAD.

This check is a bit worrying.  The check is supposed to handle the case
where the rep-cache is newer than the repository files for some reason
(hotcopy, rsync, backup/restore, something like that) but it doesn't
really do the right thing.  The check does correctly identify a bogus
cache entry as being newer than HEAD but the response is wrong.  At
present it simply ignores the bogus entry for the commit in progress but
that leaves the bogus entry to be used by later commits when the HEAD
check no longer identifies it as bogus.

When we detect a cache entry newer than HEAD we should remove the entry
from the cache at the very least.  It would probably be better to remove
all entries newer than HEAD but that would be expensive unless we
introduce an index on revision.  We have bumped the FSFS format in 1.8
so I guess we should have upgrade add an index.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download