You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by olli hauer <oh...@gmx.de> on 2012/11/25 20:18:57 UTC

Is there a way to dump the checksums from a svn repo?

Is there a way to dump the checksums from a svn repo?

What I'm doing at the moment on masters and slaves is
$> svnadmin verify
and
$> sqlite $repo/db/rep-cache.db "select hash,revision from rep_cache"

then additional comparing the sqlite output from master and slaves.

Since rep-cache is not used during read requests it would be nice to have
for example a parameter for svnadmin verify to output the checksums so
they can be compared between master and slaves.

Is there way for example via the python/perl API?

Thanks for every answer and code snippet ...

--
Regards,
olli

Re: Is there a way to dump the checksums from a svn repo?

Posted by olli hauer <oh...@gmx.de>.
On 2012-11-29 19:24, Philip Martin wrote:
> olli hauer <oh...@gmx.de> writes:
> 
>> Is there a way to dump the checksums from a svn repo?
>>
>> What I'm doing at the moment on masters and slaves is
>> $> svnadmin verify
>> and
>> $> sqlite $repo/db/rep-cache.db "select hash,revision from rep_cache"
>>
>> then additional comparing the sqlite output from master and slaves.
>>
>> Since rep-cache is not used during read requests it would be nice to have
>> for example a parameter for svnadmin verify to output the checksums so
>> they can be compared between master and slaves.
>>
>> Is there way for example via the python/perl API?
>>
>> Thanks for every answer and code snippet ...
> 
> I did it in C but I suppose you might be able to use the Python
> bindings.  I did
> 
>     svn_fs_open()
>     svn_fs_revision_root(N)
>     svn_repos_replay2(N-1)
> 
> which drove an editor from rN-1 rto rN and the editor did nothing except
> extract the checksum from the close_file callback.
> 

Thanks for the hint, I will do some tests with your promised snipped.


Re: Is there a way to dump the checksums from a svn repo?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Philip Martin wrote on Thu, Nov 29, 2012 at 18:24:38 +0000:
> olli hauer <oh...@gmx.de> writes:
> 
> > Is there a way to dump the checksums from a svn repo?
> >
> > What I'm doing at the moment on masters and slaves is
> > $> svnadmin verify
> > and
> > $> sqlite $repo/db/rep-cache.db "select hash,revision from rep_cache"
> >
> > then additional comparing the sqlite output from master and slaves.
> >
> > Since rep-cache is not used during read requests it would be nice to have
> > for example a parameter for svnadmin verify to output the checksums so
> > they can be compared between master and slaves.
> >
> > Is there way for example via the python/perl API?
> >
> > Thanks for every answer and code snippet ...
> 
> I did it in C but I suppose you might be able to use the Python
> bindings.  I did
> 
>     svn_fs_open()
>     svn_fs_revision_root(N)
>     svn_repos_replay2(N-1)
> 
> which drove an editor from rN-1 rto rN and the editor did nothing except
> extract the checksum from the close_file callback.

This will only give you the precalculated checksum stored as a metadata
attribute within the backend --- it's not going to checksum the file
on-the-fly to compute the actual checksum.

Re: Is there a way to dump the checksums from a svn repo?

Posted by Philip Martin <ph...@wandisco.com>.
olli hauer <oh...@gmx.de> writes:

> Is there a way to dump the checksums from a svn repo?
>
> What I'm doing at the moment on masters and slaves is
> $> svnadmin verify
> and
> $> sqlite $repo/db/rep-cache.db "select hash,revision from rep_cache"
>
> then additional comparing the sqlite output from master and slaves.
>
> Since rep-cache is not used during read requests it would be nice to have
> for example a parameter for svnadmin verify to output the checksums so
> they can be compared between master and slaves.
>
> Is there way for example via the python/perl API?
>
> Thanks for every answer and code snippet ...

I did it in C but I suppose you might be able to use the Python
bindings.  I did

    svn_fs_open()
    svn_fs_revision_root(N)
    svn_repos_replay2(N-1)

which drove an editor from rN-1 rto rN and the editor did nothing except
extract the checksum from the close_file callback.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Is there a way to dump the checksums from a svn repo?

Posted by Philip Martin <ph...@wandisco.com>.
Daniel Shahaf <d....@daniel.shahaf.name> writes:

>> Further, node-revision-ids can vary for other reasons.  Representations
>> in the revision files are in whatever order the client sends
>> representations to the server.  There are no defined orders for clients
>> to use so it is quite likely that commits to the master and the mirror
>> will use different orders:
>
>> That affects the offsets in the text: lines, often changing the line
>> length, which in turn affects the position of the subsequent nodes, and
>> the position of the nodes affects the node-revision-ids.
>
> Yes, that's exactly what your thread <87...@stat.home.lan> was
> about.  I thought in the end that patch got committed?

That was committed but it's not quite the same problem.  That thread was
about revision file differences caused by the server itself.  When
comparing commits on a master and slave there can also be differences
caused by the client.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Is there a way to dump the checksums from a svn repo?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Philip Martin wrote on Thu, Nov 29, 2012 at 19:13:11 +0000:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
> > Philip Martin wrote on Thu, Nov 29, 2012 at 18:26:04 +0000:
> >> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> >> 
> >> > Les Mikesell wrote on Thu, Nov 29, 2012 at 09:59:47 -0600:
> >> >> But, the copy built by svnsync doesn't necessarily
> >> >> get stored the same way, does it?
> >> >
> >> > I think in 1.8/fsfs it will byte-for-byte identical.  (except
> >> > rep-cache.db, but you can remove that file without consequences)
> >> >
> >> > There was a dev@ thread by philipm about this not too long ago.
> >> 
> >> No, an svnsync mirror is usually not identical to the master.  It does
> >> contain the same versioned data but the representation of that data is
> >> different.  For example, every failed commit on the master will bump the
> >> fsfs sequence number and that will cause the node-revision-ids to be
> >> different.
> >
> > Node-revision-id's in revisions don't embed transaction id's...
> >
> > For example the noderev header (yes, header, not just id) of
> > /subversion/trunk/notes is identical between svn.us and svn.eu.
> 
> OK.  But the sequence number differences do show up in other places:
> 
> Further, node-revision-ids can vary for other reasons.  Representations
> in the revision files are in whatever order the client sends
> representations to the server.  There are no defined orders for clients
> to use so it is quite likely that commits to the master and the mirror
> will use different orders:

> That affects the offsets in the text: lines, often changing the line
> length, which in turn affects the position of the subsequent nodes, and
> the position of the nodes affects the node-revision-ids.
> 

Yes, that's exactly what your thread <87...@stat.home.lan> was
about.  I thought in the end that patch got committed?

> svnadmin create repo
> svn mkdir -mm file://`pwd`/repo/A     # r1
> svn mkdir -mm file://`pwd`/repo/A     # fail
> svn mkdir -mm file://`pwd`/repo/A/B   # r2
> svnadmin create repo2
> svnadmin dump repo | svnadmin load repo2
> diff repo/db/revs/0/2 repo2/db/revs/0/2
> 37c37
> < _1.0.t1-2 add-dir false false /A/B
> ---
> > _1.0.t1-1 add-dir false false /A/B
> 

Well, that answers the question: revision files are not byte-for-byte
identical.

I wonder, though, if we should be rewriting these to use the revfile
noderev id's?  If not to avoid _* id's in revfiles, then to make the
revfiles deterministic by using the ("stable") revfile noderev id's ---
for the reasons given in your linked thread.

Re: Is there a way to dump the checksums from a svn repo?

Posted by olli hauer <oh...@gmx.de>.
On 2012-11-29 20:13, Philip Martin wrote:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
>> Philip Martin wrote on Thu, Nov 29, 2012 at 18:26:04 +0000:
>>> Daniel Shahaf <d....@daniel.shahaf.name> writes:
>>>
>>>> Les Mikesell wrote on Thu, Nov 29, 2012 at 09:59:47 -0600:
>>>>> But, the copy built by svnsync doesn't necessarily
>>>>> get stored the same way, does it?
>>>>
>>>> I think in 1.8/fsfs it will byte-for-byte identical.  (except
>>>> rep-cache.db, but you can remove that file without consequences)
>>>>
>>>> There was a dev@ thread by philipm about this not too long ago.
>>>
>>> No, an svnsync mirror is usually not identical to the master.  It does
>>> contain the same versioned data but the representation of that data is
>>> different.  For example, every failed commit on the master will bump the
>>> fsfs sequence number and that will cause the node-revision-ids to be
>>> different.
>>
>> Node-revision-id's in revisions don't embed transaction id's...
>>
>> For example the noderev header (yes, header, not just id) of
>> /subversion/trunk/notes is identical between svn.us and svn.eu.
> 
> OK.  But the sequence number differences do show up in other places:
> 
> svnadmin create repo
> svn mkdir -mm file://`pwd`/repo/A     # r1
> svn mkdir -mm file://`pwd`/repo/A     # fail
> svn mkdir -mm file://`pwd`/repo/A/B   # r2
> svnadmin create repo2
> svnadmin dump repo | svnadmin load repo2
> diff repo/db/revs/0/2 repo2/db/revs/0/2
> 37c37
> < _1.0.t1-2 add-dir false false /A/B
> ---
>> _1.0.t1-1 add-dir false false /A/B
> 
> Further, node-revision-ids can vary for other reasons.  Representations
> in the revision files are in whatever order the client sends
> representations to the server.  There are no defined orders for clients
> to use so it is quite likely that commits to the master and the mirror
> will use different orders:
> 
> mkdir zz
> echo foo > zz/f
> echo bar > zz/g
> echo zigzig > zz/F
> echo zagzag > zz/G
> svnadmin create repo
> svn mkdir -mm file://`pwd`/repo/A
> svnadmin create repo2
> svnsync init file://`pwd`/repo2 file://`pwd`/repo
> svnsync sync file://`pwd`/repo2
> 
> I see orders:
> 
>    repo/db/revs/0/1: foo, zigzig, zagzag, bar
>   repo2/db/revs/0/1: zigzig, zagzag, foo, bar
> 
> That affects the offsets in the text: lines, often changing the line
> length, which in turn affects the position of the subsequent nodes, and
> the position of the nodes affects the node-revision-ids.
> 

Thats what I also see with svnsync, specially for revisions with a lot of
files in the initial commit (master and mirror are the same OS and installed
with exact the same packages no matter if I sync over svn or http(s)).



Re: Is there a way to dump the checksums from a svn repo?

Posted by Philip Martin <ph...@wandisco.com>.
Philip Martin <ph...@wandisco.com> writes:

> mkdir zz
> echo foo > zz/f
> echo bar > zz/g
> echo zigzig > zz/F
> echo zagzag > zz/G
> svnadmin create repo
> svn mkdir -mm file://`pwd`/repo/A

oops! should be import not mkdir

  svn import -mm zz file://`pwd`/repo/A

> svnadmin create repo2
> svnsync init file://`pwd`/repo2 file://`pwd`/repo
> svnsync sync file://`pwd`/repo2

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Is there a way to dump the checksums from a svn repo?

Posted by Philip Martin <ph...@wandisco.com>.
Daniel Shahaf <d....@daniel.shahaf.name> writes:

> Philip Martin wrote on Thu, Nov 29, 2012 at 18:26:04 +0000:
>> Daniel Shahaf <d....@daniel.shahaf.name> writes:
>> 
>> > Les Mikesell wrote on Thu, Nov 29, 2012 at 09:59:47 -0600:
>> >> But, the copy built by svnsync doesn't necessarily
>> >> get stored the same way, does it?
>> >
>> > I think in 1.8/fsfs it will byte-for-byte identical.  (except
>> > rep-cache.db, but you can remove that file without consequences)
>> >
>> > There was a dev@ thread by philipm about this not too long ago.
>> 
>> No, an svnsync mirror is usually not identical to the master.  It does
>> contain the same versioned data but the representation of that data is
>> different.  For example, every failed commit on the master will bump the
>> fsfs sequence number and that will cause the node-revision-ids to be
>> different.
>
> Node-revision-id's in revisions don't embed transaction id's...
>
> For example the noderev header (yes, header, not just id) of
> /subversion/trunk/notes is identical between svn.us and svn.eu.

OK.  But the sequence number differences do show up in other places:

svnadmin create repo
svn mkdir -mm file://`pwd`/repo/A     # r1
svn mkdir -mm file://`pwd`/repo/A     # fail
svn mkdir -mm file://`pwd`/repo/A/B   # r2
svnadmin create repo2
svnadmin dump repo | svnadmin load repo2
diff repo/db/revs/0/2 repo2/db/revs/0/2
37c37
< _1.0.t1-2 add-dir false false /A/B
---
> _1.0.t1-1 add-dir false false /A/B

Further, node-revision-ids can vary for other reasons.  Representations
in the revision files are in whatever order the client sends
representations to the server.  There are no defined orders for clients
to use so it is quite likely that commits to the master and the mirror
will use different orders:

mkdir zz
echo foo > zz/f
echo bar > zz/g
echo zigzig > zz/F
echo zagzag > zz/G
svnadmin create repo
svn mkdir -mm file://`pwd`/repo/A
svnadmin create repo2
svnsync init file://`pwd`/repo2 file://`pwd`/repo
svnsync sync file://`pwd`/repo2

I see orders:

   repo/db/revs/0/1: foo, zigzig, zagzag, bar
  repo2/db/revs/0/1: zigzig, zagzag, foo, bar

That affects the offsets in the text: lines, often changing the line
length, which in turn affects the position of the subsequent nodes, and
the position of the nodes affects the node-revision-ids.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Is there a way to dump the checksums from a svn repo?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Philip Martin wrote on Thu, Nov 29, 2012 at 18:26:04 +0000:
> Daniel Shahaf <d....@daniel.shahaf.name> writes:
> 
> > Les Mikesell wrote on Thu, Nov 29, 2012 at 09:59:47 -0600:
> >> But, the copy built by svnsync doesn't necessarily
> >> get stored the same way, does it?
> >
> > I think in 1.8/fsfs it will byte-for-byte identical.  (except
> > rep-cache.db, but you can remove that file without consequences)
> >
> > There was a dev@ thread by philipm about this not too long ago.
> 
> No, an svnsync mirror is usually not identical to the master.  It does
> contain the same versioned data but the representation of that data is
> different.  For example, every failed commit on the master will bump the
> fsfs sequence number and that will cause the node-revision-ids to be
> different.

Node-revision-id's in revisions don't embed transaction id's...

For example the noderev header (yes, header, not just id) of
/subversion/trunk/notes is identical between svn.us and svn.eu.

Re: Is there a way to dump the checksums from a svn repo?

Posted by Philip Martin <ph...@wandisco.com>.
Daniel Shahaf <d....@daniel.shahaf.name> writes:

> Les Mikesell wrote on Thu, Nov 29, 2012 at 09:59:47 -0600:
>> But, the copy built by svnsync doesn't necessarily
>> get stored the same way, does it?
>
> I think in 1.8/fsfs it will byte-for-byte identical.  (except
> rep-cache.db, but you can remove that file without consequences)
>
> There was a dev@ thread by philipm about this not too long ago.

No, an svnsync mirror is usually not identical to the master.  It does
contain the same versioned data but the representation of that data is
different.  For example, every failed commit on the master will bump the
fsfs sequence number and that will cause the node-revision-ids to be
different.

-- 
Certified & Supported Apache Subversion Downloads:
http://www.wandisco.com/subversion/download

Re: Is there a way to dump the checksums from a svn repo?

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Les Mikesell wrote on Thu, Nov 29, 2012 at 09:59:47 -0600:
> On Thu, Nov 29, 2012 at 1:59 AM, Thorsten Schöning
> <ts...@am-soft.de> wrote:
> > Guten Tag olli hauer,
> > am Mittwoch, 28. November 2012 um 22:45 schrieben Sie:
> >
> >> Someone hacks one of the additional mirrors, modifies a revision and adjust the
> >> checksum (as described on many places how-to fix a corrupt repo) so it looks OK
> >> even with svnadmin verify.
> >
> > Sounds interesting, but if the mirrors not under your full control
> > already have been hacked how can you trust the locally produced
> > checksums by svnadmin? You can't as you can't trust the mirror in any
> > way, svnadmin could be manipulated, too, you would need to get the
> > data to a trustful environment again and check it from there.
> 
> For things where the file representation is the same, I just use an
> 'rsync -nv' against a known-good copy to verify integrity and it runs
> pretty quickly.  But, the copy built by svnsync doesn't necessarily
> get stored the same way, does it?

I think in 1.8/fsfs it will byte-for-byte identical.  (except
rep-cache.db, but you can remove that file without consequences)

There was a dev@ thread by philipm about this not too long ago.

Re: Is there a way to dump the checksums from a svn repo?

Posted by Les Mikesell <le...@gmail.com>.
On Thu, Nov 29, 2012 at 1:59 AM, Thorsten Schöning
<ts...@am-soft.de> wrote:
> Guten Tag olli hauer,
> am Mittwoch, 28. November 2012 um 22:45 schrieben Sie:
>
>> Someone hacks one of the additional mirrors, modifies a revision and adjust the
>> checksum (as described on many places how-to fix a corrupt repo) so it looks OK
>> even with svnadmin verify.
>
> Sounds interesting, but if the mirrors not under your full control
> already have been hacked how can you trust the locally produced
> checksums by svnadmin? You can't as you can't trust the mirror in any
> way, svnadmin could be manipulated, too, you would need to get the
> data to a trustful environment again and check it from there.

For things where the file representation is the same, I just use an
'rsync -nv' against a known-good copy to verify integrity and it runs
pretty quickly.  But, the copy built by svnsync doesn't necessarily
get stored the same way, does it?

-- 
  Les Mikesell
    lesmikesell@gmail.com

Re: Is there a way to dump the checksums from a svn repo?

Posted by Thorsten Schöning <ts...@am-soft.de>.
Guten Tag olli hauer,
am Mittwoch, 28. November 2012 um 22:45 schrieben Sie:

> Someone hacks one of the additional mirrors, modifies a revision and adjust the
> checksum (as described on many places how-to fix a corrupt repo) so it looks OK
> even with svnadmin verify.

Sounds interesting, but if the mirrors not under your full control
already have been hacked how can you trust the locally produced
checksums by svnadmin? You can't as you can't trust the mirror in any
way, svnadmin could be manipulated, too, you would need to get the
data to a trustful environment again and check it from there.

You solution wouldn't even scale as you had to recalculate all
checksums and compare all revisions all over again, you wouldn't have
any point in time where you could say that the first million revisions
are totally OK and could rely on that in the future.

I would think in another direction and use digital signatures to be
able to detect changes to revisions after the approval that there in a
consistent state with the master. Get unsigned revisions from the
mirrors, compare them file by file using hashes with the revisions
you trust and if everything is ok sign them. Depending on your
mirrors and the security you need you wouldn't even need to copy the
data, just make it accessible for read access during ssh or whatever.

The benefit is you could use already available tools and would only
need to check unsigned revisions, but can check the integrity of the
already signed revisions really fast and whenever you like. The
signature information for each revision file or checked block, however
you would implement such an approach, can even be stored on the
untrustful mirrors, nor problem as nobody else than you and however
you trust is able to create valid signatures.

Just an idea, as signatures were exactly made for such purposes were
one has to detect data manipulation in any way. Besides that, maybe
have look at the mirroring products of WanDisco, it's possible that
they already have a solution.

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning       E-Mail:Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow


Re: Is there a way to dump the checksums from a svn repo?

Posted by olli hauer <oh...@gmx.de>.
On 2012-11-25 21:49, Thorsten Schöning wrote:
> Guten Tag olli hauer,
> am Sonntag, 25. November 2012 um 20:18 schrieben Sie:
> 
>> Thanks for every answer and code snippet ...
> 
> I'm interested in which problem you try to solve with your approach?
> What's the reason behind it? Maybe there are other ways to accomplish
> what you want.
> 
> Mit freundlichen Grüßen,
> 
> Thorsten Schöning
> 

Sorry for the delay ...

I will try to explain some of my thoughts.

Given you have one svn master from where dedicated slaves are syncing
Both master and first slaves are under your control so far so good.

Now some additional mirrors which are not under you full control are syncing
from the slaves to help offload traffic.

Someone hacks one of the additional mirrors, modifies a revision and adjust the
checksum (as described on many places how-to fix a corrupt repo) so it looks OK
even with svnadmin verify.

Now if you have a million of revisions it will be hard to detect such an issue.

Wouldn't it be nice to have the ability to calculate the checksums regularly so
they can be compared with the upstream checksums?

Another methode to detect such thing would be rsync the repo first with a dry-run
and then do a live sync but svnsync is preferred.

--
Regards,
olli

Re: Is there a way to dump the checksums from a svn repo?

Posted by Thorsten Schöning <ts...@am-soft.de>.
Guten Tag olli hauer,
am Sonntag, 25. November 2012 um 20:18 schrieben Sie:

> Thanks for every answer and code snippet ...

I'm interested in which problem you try to solve with your approach?
What's the reason behind it? Maybe there are other ways to accomplish
what you want.

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning       E-Mail:Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow