You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@subversion.apache.org by Roland Besserer <ro...@motorola.com> on 2004/12/29 23:52:28 UTC

SVN Book Method for Splitting Repos doesn't work

Following the example on page 88, I am trying to split a repo by creating
separate repos for individual projects in the existing repo. The first steps:

        (1) dump the existing repo
        (2) svndumpfilter the project you want

work as expected and I can then populate a newly create repo from the
processed dump file of step (2) above.

As the book mentions, one will typically have to modify the node entries
to 're-root' them in the new repository. In my case, I'm using sed to
convert the entries of the original dump:

        Node-path: documentation/trunk/dir/file

to

        Node-path: trunk/dir/file

and also remove the dump data that would create the "documentation" 
directory. The resulting modified dump file appears ok and appears to
load properly (it handles revision 1, for example) until it hits the
first binary file at which point the svnadmin load command aborts with
a checksum error on that binary file.

Looking at the dump file I was surprised to see that it is not 
"human readable" as the documentation claims. The binary file (in this
case a PDF) is not uuencoded (or some similar method) but included as
8-bit 'raw' data. That, of course, makes it impossible/difficult to
inspect/edit a dump file using an editor.

Still leaves me at a loss why a simple sed script like:

        sed 's|^Node-path: documentation/|Node-path: |' < dump1 > dump2

which removes the leading 'documentation/' part from all node paths
would create this error on running 'svnadmin load newrepo < dump2':

started new transaction, based on original revision 1
     * adding path : documentation ... done.
     * adding path : branches ... done.
     * adding path : tags ... done.
     * adding path : trunk ... done.
     * adding path : trunk/business ... done.
     * adding path : trunk/design ... done.
     * adding path : trunk/meetings ... done.
     * adding path : trunk/presentations ... done.
     * adding path : trunk/reference ... done.
     * adding path : trunk/reference/DAVIC ... done.
     * adding path : trunk/design/DesignBook/DesignBook.pdf ...svn: Checksum mis
match, rep 'a':
   expected:  225d1ed316bf0830dbdd6c50ff1e79e7
     actual:  f41751bce1f5fe359351df3a9b37be30


Has anyone seen this problem before?

Regards

roland


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: SVN Book Method for Splitting Repos doesn't work

Posted by Max Bowsher <ma...@ukf.net>.

Roland Besserer wrote:
...
> Looking at the dump file I was surprised to see that it is not
> "human readable" as the documentation claims. The binary file (in this
> case a PDF) is not uuencoded (or some similar method) but included as
> 8-bit 'raw' data.

Well, I guess our interpretations of "human-readable" differ.
Personally, I think that uuencoding (or similar) doesn't increase 
human-readability, it just wastes processing time.

> That, of course, makes it impossible/difficult to
> inspect/edit a dump file using an editor.

Get a better editor, then.

> Still leaves me at a loss why a simple sed script like:
>
>        sed 's|^Node-path: documentation/|Node-path: |' < dump1 > dump2
>
> which removes the leading 'documentation/' part from all node paths
> would create this error on running 'svnadmin load newrepo < dump2':
>
> started new transaction, based on original revision 1
>     * adding path : documentation ... done.
>     * adding path : branches ... done.
>     * adding path : tags ... done.
>     * adding path : trunk ... done.
>     * adding path : trunk/business ... done.
>     * adding path : trunk/design ... done.
>     * adding path : trunk/meetings ... done.
>     * adding path : trunk/presentations ... done.
>     * adding path : trunk/reference ... done.
>     * adding path : trunk/reference/DAVIC ... done.
>     * adding path : trunk/design/DesignBook/DesignBook.pdf ...svn: 
> Checksum
> mis match, rep 'a':
>   expected:  225d1ed316bf0830dbdd6c50ff1e79e7
>     actual:  f41751bce1f5fe359351df3a9b37be30
>
>
> Has anyone seen this problem before?

Looks fairly obvious that the implementation of sed you are using is munging 
the data in some way.

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: SVN Book Method for Splitting Repos doesn't work

Posted by Patrick Smears <pa...@ensoft.co.uk>.

On 29 Dec 2004, Roland Besserer wrote:

>[...]
> Still leaves me at a loss why a simple sed script like:
> 
>         sed 's|^Node-path: documentation/|Node-path: |' < dump1 > dump2
> 
> which removes the leading 'documentation/' part from all node paths
> would create this error on running 'svnadmin load newrepo < dump2':
> 
> started new transaction, based on original revision 1
>      * adding path : documentation ... done.
>      * adding path : branches ... done.
>      * adding path : tags ... done.
>      * adding path : trunk ... done.
>      * adding path : trunk/business ... done.
>      * adding path : trunk/design ... done.
>      * adding path : trunk/meetings ... done.
>      * adding path : trunk/presentations ... done.
>      * adding path : trunk/reference ... done.
>      * adding path : trunk/reference/DAVIC ... done.
>      * adding path : trunk/design/DesignBook/DesignBook.pdf ...svn: Checksum mis
> match, rep 'a':
>    expected:  225d1ed316bf0830dbdd6c50ff1e79e7
>      actual:  f41751bce1f5fe359351df3a9b37be30
> 
> Has anyone seen this problem before?

I haven't, but the first thing that occurs to me is, what platform are you
running, and which version of 'sed'? Some versions are not very good with
long lines and/or lines containing control characters, so it may be that
'sed' is messing with the file contents.... If you're not using it already
you may have better luck with GNU 'sed'; failing that converting your
'sed' to 'perl' may well do the trick (the 's2p' utility will convert
'sed' to 'perl' if you're not a perl wizard...)

Patrick
-- 
The easy way to type accents in Windows: http://www.frkeys.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: SVN Book Method for Splitting Repos doesn't work

Posted by John Szakmeister <jo...@szakmeister.net>.

Roland Besserer wrote:
> Sorry about mangling the name - honest typo :-)
> 
> Almost everything boils down to tradeoffs - we all have to make them in
> our designs or implementations. I can certainly see the reason for
> efficieny in the implementation of svn and the way data is stored and
> managed. I do not think the efficency argument is nearly as strong
> for a dump utility and dump file format which is, by implication, used
> sporadically. Not being part of the frequent operations performed as
> part of running svn on a daily basis, efficiency concerns should not
> supercede what I would call 'easy of use' - the ability to easily
> process the dump file with the widest variety of tools. I simply fail
> understand why one would want to place any arbitrary restrictions
> or limitations on the dumpf file format - particularly on that is
> announced as human readable. In my opinion the dump file
> format should be 7-bit ASCII (back to uuencode and the like :-) 

I just wanted to point out that the assumption that 'svnadmin dump' is
run 'sporadically' isn't necessarily true.  There are a number of folks
performing incremental dumps with every commit as part of having a
complete backup/recovery strategy.  'svnadmin dump' is run hundreds of
times per day in our setup.

It may have been intended originally for repository migration, but has
found other uses since then. :-)

> Anyway, not much point in rambling on about this. I can always write
> my own dump routine :-) ... but I think I can live with what we've got.

-John

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: SVN Book Method for Splitting Repos doesn't work

Posted by Roland Besserer <ro...@motorola.com>.

Sorry about mangling the name - honest typo :-)

Almost everything boils down to tradeoffs - we all have to make them in
our designs or implementations. I can certainly see the reason for
efficieny in the implementation of svn and the way data is stored and
managed. I do not think the efficency argument is nearly as strong
for a dump utility and dump file format which is, by implication, used
sporadically. Not being part of the frequent operations performed as
part of running svn on a daily basis, efficiency concerns should not
supercede what I would call 'easy of use' - the ability to easily
process the dump file with the widest variety of tools. I simply fail
understand why one would want to place any arbitrary restrictions
or limitations on the dumpf file format - particularly on that is
announced as human readable. In my opinion the dump file
format should be 7-bit ASCII (back to uuencode and the like :-) 

Anyway, not much point in rambling on about this. I can always write
my own dump routine :-) ... but I think I can live with what we've got.

roland




"Max Bowsher"<ma...@ukf.net> writes:

> Roland Besserer wrote:
> ...
> > I would like to comment on the concept of 'human readable' though.
> > Although emacs (for example) can easily handle binary files just
> > dumped into the output file, including 8-bit data sure doesn't make
> > the dump file human readable anymore. It also makes processing the
> > dump file with text (or more accurately line) oriented tools error
> > prone.
> >
> > SVN is already, in my opinion, somewhat handicapped by the fact that
> > it uses a database backend
> 
> You seem to have ignored FSFS.
> 
> > and thus a binary file format that puts you
> > at the mercy of the decode/repair tools specifically designed for
> > it.
> 
> Under the hood, the formats really aren't that much more difficult to
> comprehend than CVSes. Anyone who really wants to peek under the
> covers, is free to do so.
> 
> > It would be nice if at least the dump file format would stick to
> > an ASCII only representation that makes processing of dump files with
> > 'standard' utilities easy and less error prone.
> >
> > Max Bowsler made the interesting comment that "Personally, I think
> > that uuencoding (or similar) doesn't increase human-readability, it
> > just wastes processing time" which I completely disagree with. Who
> > cares about minute incremental decoding time or even file size in
> > this age of multi-GHz processors and 100GB disks. Human readable
> > is a term that should not be taken literally. To me it means that it
> > is an ASCII/text based representation I can feed any tool like sed or
> > awk with.
> 
> Hi, it's me again :-). "Bowsher" not "Bowsler", by the way.
> 
> This is a debate about tradeoffs -
> 
> Subversion saves processing time and file size, at the cost of putting
> greater requirements on the tools used.
> 
> I happen to feel that this is the right tradeoff to make in this case.
> 
> Any small overhead can become quite magnified when dealing with
> gigabytes of data, and if you want to restrict the available byte
> values to printable ASCII, then the amount of space required to store
> arbitrary data will increase by approximately a factor of 3.
> 
> The downside, of course, is the increased restrictions on the tools:
> 
> I think expecting data processing tools to be 8-bit clean is a
> reasonable demand for newly engineered systems today.
> 
> There is the further complication, of course, of dumpfile-header-like
> data appearing in the middle of file content - I admit that this is a
> harder problem. However, both perl and python are excellent tools, and
> the dumpfile format has been deliberately designed to be easily
> parseable, offering a way to cleanly circumvent this issue.
> 
> 
> One particular choice of tradeoffs will never be the optimum for all cases.
> The particular choice made by Subversion happens to work very nicely
> for the common cases of using dumpstreams for backup and migration.
> 
> 
> Max.
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: SVN Book Method for Splitting Repos doesn't work

Posted by Max Bowsher <ma...@ukf.net>.

Patrick Smears wrote:
> On Thu, 30 Dec 2004, Max Bowsher wrote:
>
>> [...]
>> Any small overhead can become quite magnified when dealing with gigabytes 
>> of
>> data, and if you want to restrict the available byte values to printable
>> ASCII, then the amount of space required to store arbitrary data will
>> increase by approximately a factor of 3.
>
> In general I agree 100% with what you're saying - but I'm puzzled as to
> where the factor of 3 comes in? I'd have thought that, with some sort of
> base64 encoding, you'd be able to store 6 bits of "real" data for every 8
> bits of "encoded" data - give or take some overhead for padding, sensible
> line breaks etc - so I'd have thought a figure of 30-40% extra would seem
> more likely... playing with uuencode would seem to confirm this:
>
> % head -c1000000 /dev/urandom | uuencode - | wc -c
> 1377800
>
> Indeed, I'd have thought that storing each "real" byte as two hex digits
> would only double the output size... What am I missing here?

I think I was hallucinating my mathematics :-)

Max.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: SVN Book Method for Splitting Repos doesn't work

Posted by Patrick Smears <pa...@ensoft.co.uk>.

On Thu, 30 Dec 2004, Max Bowsher wrote:

> [...]
> Any small overhead can become quite magnified when dealing with gigabytes of 
> data, and if you want to restrict the available byte values to printable 
> ASCII, then the amount of space required to store arbitrary data will 
> increase by approximately a factor of 3.

In general I agree 100% with what you're saying - but I'm puzzled as to
where the factor of 3 comes in? I'd have thought that, with some sort of
base64 encoding, you'd be able to store 6 bits of "real" data for every 8
bits of "encoded" data - give or take some overhead for padding, sensible
line breaks etc - so I'd have thought a figure of 30-40% extra would seem
more likely... playing with uuencode would seem to confirm this:

% head -c1000000 /dev/urandom | uuencode - | wc -c
1377800

Indeed, I'd have thought that storing each "real" byte as two hex digits
would only double the output size... What am I missing here?

Patrick
-- 
The easy way to type accents in Windows: http://www.frkeys.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: SVN Book Method for Splitting Repos doesn't work

Posted by Max Bowsher <ma...@ukf.net>.

Roland Besserer wrote:
...
> I would like to comment on the concept of 'human readable' though.
> Although emacs (for example) can easily handle binary files just
> dumped into the output file, including 8-bit data sure doesn't make
> the dump file human readable anymore. It also makes processing the
> dump file with text (or more accurately line) oriented tools error
> prone.
>
> SVN is already, in my opinion, somewhat handicapped by the fact that
> it uses a database backend

You seem to have ignored FSFS.

> and thus a binary file format that puts you
> at the mercy of the decode/repair tools specifically designed for
> it.

Under the hood, the formats really aren't that much more difficult to 
comprehend than CVSes. Anyone who really wants to peek under the covers, is 
free to do so.

> It would be nice if at least the dump file format would stick to
> an ASCII only representation that makes processing of dump files with
> 'standard' utilities easy and less error prone.
>
> Max Bowsler made the interesting comment that "Personally, I think
> that uuencoding (or similar) doesn't increase human-readability, it
> just wastes processing time" which I completely disagree with. Who
> cares about minute incremental decoding time or even file size in
> this age of multi-GHz processors and 100GB disks. Human readable
> is a term that should not be taken literally. To me it means that it
> is an ASCII/text based representation I can feed any tool like sed or
> awk with.

Hi, it's me again :-). "Bowsher" not "Bowsler", by the way.

This is a debate about tradeoffs -

Subversion saves processing time and file size, at the cost of putting 
greater requirements on the tools used.

I happen to feel that this is the right tradeoff to make in this case.

Any small overhead can become quite magnified when dealing with gigabytes of 
data, and if you want to restrict the available byte values to printable 
ASCII, then the amount of space required to store arbitrary data will 
increase by approximately a factor of 3.

The downside, of course, is the increased restrictions on the tools:

I think expecting data processing tools to be 8-bit clean is a reasonable 
demand for newly engineered systems today.

There is the further complication, of course, of dumpfile-header-like data 
appearing in the middle of file content - I admit that this is a harder 
problem. However, both perl and python are excellent tools, and the dumpfile 
format has been deliberately designed to be easily parseable, offering a way 
to cleanly circumvent this issue.

One particular choice of tradeoffs will never be the optimum for all cases.
The particular choice made by Subversion happens to work very nicely for the 
common cases of using dumpstreams for backup and migration.

Max.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

Re: SVN Book Method for Splitting Repos doesn't work

Posted by Roland Besserer <ro...@motorola.com>.

Ok, the problem did in fact turn out to be that I was running /bin/sed
instead of /usr/local/bin/sed.

I would like to comment on the concept of 'human readable' though.
Although emacs (for example) can easily handle binary files just
dumped into the output file, including 8-bit data sure doesn't make
the dump file human readable anymore. It also makes processing the
dump file with text (or more accurately line) oriented tools error
prone.

SVN is already, in my opinion, somewhat handicapped by the fact that
it uses a database backend and thus a binary file format that puts you
at the mercy of the decode/repair tools specifically designed for
it. It would be nice if at least the dump file format would stick to
an ASCII only representation that makes processing of dump files with
'standard' utilities easy and less error prone.

Max Bowsler made the interesting comment that "Personally, I think
that uuencoding (or similar) doesn't increase human-readability, it
just wastes processing time" which I completely disagree with. Who
cares about minute incremental decoding time or even file size in
this age of multi-GHz processors and 100GB disks. Human readable
is a term that should not be taken literally. To me it means that it
is an ASCII/text based representation I can feed any tool like sed or
awk with.

I want to be able to modify, cut, splice n dice the dump file at will,
using line oriented utilities and not get surprised by the odd junk of
binary data in it. 

roland

"Dale Worley"<dw...@pingtel.com> writes:

> I expect that your problem is coming from sed's effect on binary input.  As
> others have mentioned, it's likely that it is not reacting well to what it
> perceives as excessively long lines, or line-ending characters that are not
> in the correct configuration for your platform.  (This will be particularly
> grim for programs (like sed) that are written in C when run on platforms
> that use CR-LF for end-of-line, as the C I/O library is required to
> translate EOLs into LF upon input and back into CR-LF on output.)
> 
> One possibility would be to use Gnu Emacs, which is quite robust against
> these problems (provided you use find-file-literally to make sure that it
> doesn't try to be intelligent about the character encoding of your file).
> I've successfully used it to patch the character strings in executable
> files, so it should work for you.
> 
> Another thing which might help is to not rewrite the file names.  Instead,
> dump and svnfilter the data to extract the files you want, then load that
> into the new repository.  Then use svn commands to adjust the file names.
> For instance, you want to change "documentation/trunk/dir/file" into
> "trunk/dir/file", so you could do "svn mv documentation/trunk trunk".  That
> sequence of operations would use only software that was already proven on
> the tasks in question.
> 
> Dale
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org

RE: SVN Book Method for Splitting Repos doesn't work

Posted by Dale Worley <dw...@pingtel.com>.

I expect that your problem is coming from sed's effect on binary input.  As
others have mentioned, it's likely that it is not reacting well to what it
perceives as excessively long lines, or line-ending characters that are not
in the correct configuration for your platform.  (This will be particularly
grim for programs (like sed) that are written in C when run on platforms
that use CR-LF for end-of-line, as the C I/O library is required to
translate EOLs into LF upon input and back into CR-LF on output.)

One possibility would be to use Gnu Emacs, which is quite robust against
these problems (provided you use find-file-literally to make sure that it
doesn't try to be intelligent about the character encoding of your file).
I've successfully used it to patch the character strings in executable
files, so it should work for you.

Another thing which might help is to not rewrite the file names.  Instead,
dump and svnfilter the data to extract the files you want, then load that
into the new repository.  Then use svn commands to adjust the file names.
For instance, you want to change "documentation/trunk/dir/file" into
"trunk/dir/file", so you could do "svn mv documentation/trunk trunk".  That
sequence of operations would use only software that was already proven on
the tasks in question.

Dale

-----Original Message-----
From: roland@kanaha.am.mot.com [mailto:roland@kanaha.am.mot.com]On
Behalf Of Roland Besserer
Sent: Wednesday, December 29, 2004 6:52 PM
To: users@subversion.tigris.org
Subject: SVN Book Method for Splitting Repos doesn't work



Following the example on page 88, I am trying to split a repo by creating
separate repos for individual projects in the existing repo. The first
steps:

        (1) dump the existing repo
        (2) svndumpfilter the project you want

work as expected and I can then populate a newly create repo from the
processed dump file of step (2) above.

As the book mentions, one will typically have to modify the node entries
to 're-root' them in the new repository. In my case, I'm using sed to
convert the entries of the original dump:

        Node-path: documentation/trunk/dir/file

to

        Node-path: trunk/dir/file

and also remove the dump data that would create the "documentation"
directory. The resulting modified dump file appears ok and appears to
load properly (it handles revision 1, for example) until it hits the
first binary file at which point the svnadmin load command aborts with
a checksum error on that binary file.

Looking at the dump file I was surprised to see that it is not
"human readable" as the documentation claims. The binary file (in this
case a PDF) is not uuencoded (or some similar method) but included as
8-bit 'raw' data. That, of course, makes it impossible/difficult to
inspect/edit a dump file using an editor.

Still leaves me at a loss why a simple sed script like:

        sed 's|^Node-path: documentation/|Node-path: |' < dump1 > dump2

which removes the leading 'documentation/' part from all node paths
would create this error on running 'svnadmin load newrepo < dump2':

started new transaction, based on original revision 1
     * adding path : documentation ... done.
     * adding path : branches ... done.
     * adding path : tags ... done.
     * adding path : trunk ... done.
     * adding path : trunk/business ... done.
     * adding path : trunk/design ... done.
     * adding path : trunk/meetings ... done.
     * adding path : trunk/presentations ... done.
     * adding path : trunk/reference ... done.
     * adding path : trunk/reference/DAVIC ... done.
     * adding path : trunk/design/DesignBook/DesignBook.pdf ...svn: Checksum
mis
match, rep 'a':
   expected:  225d1ed316bf0830dbdd6c50ff1e79e7
     actual:  f41751bce1f5fe359351df3a9b37be30


Has anyone seen this problem before?

Regards

roland


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@subversion.tigris.org
For additional commands, e-mail: users-help@subversion.tigris.org