You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Geoff Worboys <ge...@telesiscomputing.com.au> on 2010/06/22 14:36:40 UTC

Generating a dump file using a powershell script

Hi All,

I've just joined this group.  I've been using subversion for a
few years now - most of my day to day stuff via TortoiseSvn.
A few days ago I once again came across a requirement where I
said "subversion is what I need here" only to once again hit
the issue that to start a new project in subversion means
losing all the file time-stamps.  I don't want to re-start
arguments on that front (I see from googling and archives that
it is a VERY old discussion).  I simply have some questions in
regard to my own chosen work-around to the problem.

It seemed to me that for most of my requirements I did not need
extra features in subversion, all I really needed was some way
to create the new repository so that it looked like all the
files I imported were committed at the time it says on the file
from the original source.  If I could get that then I could use
the "use-commit-times" option to keep things very close to the
way wanted them.  [And I could keep using TortoiseSvn and I
would be a happy man.]

That all led me to trying to create my own dump files.  I ended
up choosing powershell scripting because I wanted to learn about
it and this seemed like an interesting project to try with it.
I have a working script now, put simply it is executed as:

powershell .\Import-from-Source D:\SourceFolder D:\Temp\DumpFile.dat

It takes the entire contents of D:\SourceFolder and creates
a subversion dump file in D:\Temp\DumpFile.dat.  It replicates
the structure inside D:\SourceFolder so if you want a "trunk"
folder etc you have to have created them first.

Objects (the full tree) from D:\SourceFolder are first sorted
by their last-write-time property and I then create a revision
entry for each date that appears (the revision resolution is
adjustable in the script).  This makes it so that each file
ends up appearing to have been committed on the same date that
it had on the original source file, so checking out the files
with the use-commit-times option gives them same date as the
original file (if not, necessarily, exactly the same time).

Yippee, it works.

Now to some gritty details, which is why I am here.


Q1:  If, in the dump file, I sometimes give a file a property
svn:eol-style = native, but the file itself has been copied
directly into the dump file (ie. contains CRLF end-of-lines)
is that going to matter to svnadmin load?

[Will the load process take care of things for me or do I
need to parse such files and make them all LF - which is what
svn says it uses internally for "native" files? ]

My experiments seemed to show that svnadmin dump also produced
the the CRLF end-of-lines but it all gets quite confusing so
thought I would ask here.

Since I mostly work under Windows it's probably not a big deal
for me ... but I'd rather the script was correct in case it
gets used by others that may have other requirements.


Q2:  When writing the code to try and identify text versus
binary files I decided to look at what subversion did ... but
now I am confused.  In libsvn_subr\io.c function
svn_io_detect_mimetype2 a comment says:
     going to examine the first block of data, and make sure that 85%
     of the bytes are such that their value is in the ranges 0x07-0x0D
     or 0x20-0x7F, and that 100% of those bytes is not 0x00.
but my reading of this code
      if (((binary_count * 1000) / amt_read) > 850)
        {
          *mimetype = generic_binary;
          return SVN_NO_ERROR;
        }
suggests that it is actually setting the type to binary only
if it finds more than 85% are binary bytes (in earlier code a
file binary if forced if any null byte is found).

Can anyone explain this?  A bug or am I missing something?


Q3:  If there are already other scripts around that do this
then feel free to tell me that I have wasted my time.  I could
not find any similar solutions in my searching.


Q4:  If there are any powershell people here that would like to
review and test the code I am quite happy to share it ... but
would not recommend it to a scripting novice until it has been
checked over and tested by more than me.


Q5:  I found a description of the dump file in the source but
that description says "Properties are stored in the same
human-readable hashdump format used by working copy property
files,"   Any pointers to a description for that?

(Obviously I've gotten by just by visually checking dump files
produced by svnadmin, but it would be good to know what I was
doing. ;-)


Hmmm... big post for my first post.  Hope that's okay.

-- 
Geoff Worboys
Telesis Computing

Re: Generating a dump file using a powershell script

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Geoff Worboys wrote on Wed, 23 Jun 2010 at 08:33 -0000:
> Well certainly it takes care of the line feeds if you create
> the property svn:eol-style=native at some point after
> committing the original file.  A dump after that commit shows
> the entire file repeated with the new eols ...
> 
> But I've adjusted my script now, any files the script identifies
> (by extension) as needing svn:eol-style=native are appropriately
> translated into the dump file.
> 

OK

> Anyone here interested in looking-at/experimenting-with the
> powershell script let me know, I am happy to share it.  Does
> this list accept attachments or can that be done only via
> private email?
> 

It accepts some kinds of attachments.  (I'm not sure which; don't feel
bad if you have to try twice.  But text/plain should work.)  And if not
by private email, you could always pastebin the script and/or upload it
to your homepage...

Re: Generating a dump file using a powershell script

Posted by Geoff Worboys <ge...@telesiscomputing.com.au>.
Daniel Shahaf wrote:
> svnadmin operates at a level below the sanity checks (it
> talks to libsvn_fs directly most of the time) --- it'll
> load the dumpfile literally.  svn doesn't complain outright,
> okay, and I suspect it may even correct the linefeeds for
> you on the first commit to the file in question.

Well certainly it takes care of the line feeds if you create
the property svn:eol-style=native at some point after
committing the original file.  A dump after that commit shows
the entire file repeated with the new eols ...

But I've adjusted my script now, any files the script identifies
(by extension) as needing svn:eol-style=native are appropriately
translated into the dump file.

I am quite pleased with the result, it certainly seems to solve
my file time-stamp problem well enough.  Still to do some more
testing but it looks promising.


Anyone here interested in looking-at/experimenting-with the
powershell script let me know, I am happy to share it.  Does
this list accept attachments or can that be done only via
private email?


Thanks again.

-- 
Geoff Worboys
Telesis Computing

Re: Generating a dump file using a powershell script

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Geoff Worboys wrote on Wed, 23 Jun 2010 at 04:12 -0000:
> Daniel Shahaf wrote:
> > i.e., 'svnadmin dump' produces CRLF for svn:eol-style=native
> > files?  That surprises me; I'd expect such files to be
> > outputted with LF in dump files.  (My testing agrees with my
> > expectation.)  Can you double-check?
> 
> > In any case, it probably *should* use LF, since dumpfiles are
> > supposed to be a portable binary format.
> 
> The strange thing, to me, was that while svnadmin load did
> not "correct" the line endings when it loaded the file nor
> did svn seem to corrupt the file when checking out.  (I had
> thought it might create files with CRCRLF or some such.)
> That is not a complaint BTW ;-)
> 

svnadmin operates at a level below the sanity checks (it talks to
libsvn_fs directly most of the time) --- it'll load the dumpfile
literally.  svn doesn't complain outright, okay, and I suspect it may
even correct the linefeeds for you on the first commit to the file in
question.

Re: Generating a dump file using a powershell script

Posted by Geoff Worboys <ge...@telesiscomputing.com.au>.
Daniel Shahaf wrote:
> i.e., you import the files in order of their timestamps, so
> that svn:date remain globally sorted?

> Nice!

Yes, I thought so.  :-)


> i.e., 'svnadmin dump' produces CRLF for svn:eol-style=native
> files?  That surprises me; I'd expect such files to be
> outputted with LF in dump files.  (My testing agrees with my
> expectation.)  Can you double-check?

> In any case, it probably *should* use LF, since dumpfiles are
> supposed to be a portable binary format.

I think you are correct.  I have an odd mix of svn repositories
here, some created by cvs2svn and some created directly by
various versions of svn ... and a few now created from script.

I do have a repository (originally created from cvs2svn) that
does dump files with property svn:eol-style=native but that
output with CRLF in the dump files.  Suspect something went
astray there.  I have vague memories of playing with the dump
files back when I created this repository so it may be a
problem that I caused ... or not.

It does appear that svnadmin accepts the dump file as the
literal truth - with minimal validation.  For example I had
originally tried using ISO8601 timestamps on my files, eg:
  2010-10-31T12:34:56+10:00
and svnadmin load built the repository but svn itself ends up
complaining about bogus dates.  Luckily the script was easy
enough to change over to UTC timestamps.

The strange thing, to me, was that while svnadmin load did
not "correct" the line endings when it loaded the file nor
did svn seem to corrupt the file when checking out.  (I had
thought it might create files with CRCRLF or some such.)
That is not a complaint BTW ;-)


>> Can anyone explain this?  A bug or am I missing something?
>> 

> What's the question?  Are you saying the code/comment disagree?

Yes they disagree.  The question is: Which is right? (or Which
was the original intention?)

I see Bert/Julian have moved that part of the post to the dev
list but I have not subscribed there at this time.  I am
content to leave the decision on how to handle with the devs,
I just wanted my script to be consistent with svn and wanted
it to automatically identify binary files distinct from text.

I imagine the svn code wants to accept some "binary" bytes in
order to see utf8 files as text ...  but never having analysed
the distribution properties of utf8 I could not guess what
would be likely to work best - but do know the >0x7F should be
analysed separately to the other control characters.  [If utf8
is not required then I would imagine that any "binary" at all
would indicate the file is not a text file.]


> Internally the function it uses is svn_hash_write2(), and
> there's a small documentation comment at the top of hash.c.
> But, as you say,

>> (Obviously I've gotten by just by visually checking dump
>> files produced by svnadmin, but it would be good to know
>> what I was doing. ;-)
>> 

> the format isn't hard to reverse-engineer, right?

Not difficult ... but there are some subtleties in regard to
whether (and which) new-line characters are part of certain
data counts and trying to make sure my code attaches various
delimiting new-lines to the correct blocks of output ... etc.
If it was purely a text file then many things would be more
obvious but being a mix of binary and text the use of \n
delimiters needs to be careful (and should be explicit).


Thanks for your response, most appreciated.

-- 
Geoff Worboys
Telesis Computing

Re: Generating a dump file using a powershell script

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Geoff Worboys wrote on Tue, 22 Jun 2010 at 17:36 -0000:
> powershell .\Import-from-Source D:\SourceFolder D:\Temp\DumpFile.dat
> 
> It takes the entire contents of D:\SourceFolder and creates
> a subversion dump file in D:\Temp\DumpFile.dat.  It replicates
> the structure inside D:\SourceFolder so if you want a "trunk"
> folder etc you have to have created them first.
> 
> Objects (the full tree) from D:\SourceFolder are first sorted
> by their last-write-time property and I then create a revision
> entry for each date that appears (the revision resolution is
> adjustable in the script).  This makes it so that each file
> ends up appearing to have been committed on the same date that
> it had on the original source file, so checking out the files
> with the use-commit-times option gives them same date as the
> original file (if not, necessarily, exactly the same time).
> 

i.e., you import the files in order of their timestamps, so that
svn:date remain globally sorted?

Nice!

> Q1:  If, in the dump file, I sometimes give a file a property
> svn:eol-style = native, but the file itself has been copied
> directly into the dump file (ie. contains CRLF end-of-lines)
> is that going to matter to svnadmin load?
> 
> [Will the load process take care of things for me or do I
> need to parse such files and make them all LF - which is what
> svn says it uses internally for "native" files? ]
> 
> My experiments seemed to show that svnadmin dump also produced
> the the CRLF end-of-lines but it all gets quite confusing so
> thought I would ask here.
> 

i.e., 'svnadmin dump' produces CRLF for svn:eol-style=native files?
That surprises me; I'd expect such files to be outputted with LF in dump
files.  (My testing agrees with my expectation.)  Can you double-check?

In any case, it probably *should* use LF, since dumpfiles are supposed
to be a portable binary format.

> Since I mostly work under Windows it's probably not a big deal
> for me ... but I'd rather the script was correct in case it
> gets used by others that may have other requirements.
> 
> 
> Q2:  When writing the code to try and identify text versus
> binary files I decided to look at what subversion did ... but
> now I am confused.  In libsvn_subr\io.c function
> svn_io_detect_mimetype2 a comment says:
>      going to examine the first block of data, and make sure that 85%
>      of the bytes are such that their value is in the ranges 0x07-0x0D
>      or 0x20-0x7F, and that 100% of those bytes is not 0x00.
> but my reading of this code
>       if (((binary_count * 1000) / amt_read) > 850)
>         {
>           *mimetype = generic_binary;
>           return SVN_NO_ERROR;
>         }
> suggests that it is actually setting the type to binary only
> if it finds more than 85% are binary bytes (in earlier code a
> file binary if forced if any null byte is found).
> 
> Can anyone explain this?  A bug or am I missing something?
> 

What's the question?  Are you saying the code/comment disagree?

> Q5:  I found a description of the dump file in the source but
> that description says "Properties are stored in the same
> human-readable hashdump format used by working copy property
> files,"   Any pointers to a description for that?
> 

You're quoting <http://svn.apache.org/repos/asf/subversion/trunk/notes/dump-load-format.txt>.

Internally the function it uses is svn_hash_write2(), and there's
a small documentation comment at the top of hash.c.  But, as you say,

> (Obviously I've gotten by just by visually checking dump files
> produced by svnadmin, but it would be good to know what I was
> doing. ;-)
> 

the format isn't hard to reverse-engineer, right?

> 
> Hmmm... big post for my first post.  Hope that's okay.
> 
> 

Yeah.  For next time, you could consider adding a one-paragraph summary
at the top, and/or make it clear what kind of responses you're looking
for (e.g., "Hey, I'm looking for people to try my script", or "Hey, I'm
looking for answers to questions I ran into developing a script", or ...)