You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Greg Stein <gs...@lyra.org> on 2003/04/28 22:44:07 UTC

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

On Mon, Apr 28, 2003 at 03:27:48PM -0500, kfogel@tigris.org wrote:
>...
> +++ branches/cvs2svn-kfogel/tools/cvs2svn/cvs2svn.py	Mon Apr 28 15:27:44 2003
>...
> +def ensure_directories(path, root, dumpfile):
> +  """Output to DUMPFILE any intermediate directories in PATH that are
> +  not already present under node ROOT, adding them to ROOT's tree as
> +  we go.  Return the last parent directory, that is, the parent
> +  of PATH's basename.  Leading slash(es) on PATH are optional."""
> +  path = path.lstrip('/')

This string method usage requires Python 2.0, whereas cvs2svn.py used to
only require Python 1.5.2. And with the future dropping of the bindings,
then it will be even easier to only do 1.5.2 (while our bindings can work
against 1.5.2, I'm not sure how many people try that).

So... that would change to:

  path = string.lstrip(path, '/')

> +  path_so_far = None
> +  components = string.split(path, '/')

One of the things that I like to do is:

  components = filter(None, string.split(path, '/'))

That filter() usage will filter out all empty strings, which means it
ignores leading and trailing slashes, and double-slashes. Kinda nifty :-)

> +  last_idx = len(components) - 1
> +  this_node = root
> +
> +  i = 0
> +  while (i < last_idx):
> +
> +    component = components[i]

I don't think you need the "i" logic. Just do:

  for component in components:

> +    if path_so_far:
> +      path_so_far += '/' + component

The += construct is also Python 2.0 based.

>...
> +def get_md5(path):
> +  """Return the hex md5 digest of file PATH."""
> +  f = open(path, 'r')
> +  checksum = md5.new()
> +  buf = f.read(102400)
> +  while buf:
> +    checksum.update(buf)
> +    buf = f.read(102400)
> +  f.close()
> +  return checksum.hexdigest()

There isn't a real need to close the file, as it will happen when "f" goes
out of scope.

>...
> +    # Make the dumper's temp directory for this run.  RCS working
> +    # files get checked out into here.
> +    os.mkdir(self.tmpdir)

Couldn't you use 'co -p' and avoid temp files altogether?

>...
> +    # Anything ending in ".1" is a new file.
> +    #
> +    # ### We could also use the parent_node to determine this.
> +    # ### Maybe we should, too, because ".1" is not perfectly
> +    # ### reliable, because of 'cvs commit -r'...
> +    if re.match('.*\\.1$', cvs_rev):

Bah. Perl-style overkill :-)

  if cvs_rev[-2:] == '.1':

>...
> +    self.dumpfile.write('Node-path: %s\n' % svn_path)
> +    self.dumpfile.write('Node-kind: file\n')
> +    self.dumpfile.write('Node-action: %s\n' % action)
> +    self.dumpfile.write('Prop-content-length: %d\n' % 10)  ### svn:executable?
> +    self.dumpfile.write('Text-content-length: %d\n' % os.path.getsize(working))
> +    self.dumpfile.write('Text-content-md5: %s\n' % get_md5(working))
> +    self.dumpfile.write('Content-length: %d\n' % 0) # todo
> +    self.dumpfile.write('\n')
> +    self.dumpfile.write('PROPS-END\n')

Might be interesting to have a little helper function/object that writes out
all this stuff, and can be shared between here and the Node class.

> +    ### This is a pity.  We already ran over all the file's bytes to
> +    ### get the checksum, now we have to do it again to insert the
> +    ### file's contents into the dumpstream?  What a lose.
> +    ###
> +    ### A solution: write '00000000000000000000000000000000' for the
> +    ### initial checksums, then go back and patch them up after the
> +    ### entire dumpfile has been written.  (We'd calculate the
> +    ### checksum as we get each file's contents, record it somewhere,
> +    ### and look it up during the patchup phase.)
> +    ###
> +    ### We could also use the `md5sum' utility to get the checksum,
> +    ### and the OS's file append capability to append the file.  But
> +    ### then we'd have a dependency on `md5sum' (yuck); and how would
> +    ### our open filehandle interact with data being inserted behind
> +    ### its back?  Not very well, I imagine.

Right. This plays into the use of 'co -p', too. Nominally, it would be great
to be able to do:

  f = os.popen('co -p ...')
  checksum = md5.new()
  buf = f.read(CHUNK_SIZE)
  while buf:
    checksum.update(buf)
    dumpfile.write(buf)
    buf = f.read(CHUNK_SIZE)
  digest = checksum.hexdigest()

The problem is going back. You can use dumpfile.tell() to get a seek
position, and then do the backwards seek stuff. Another alternative would be
taking a direction from HTTP and have the notion of "trailing" headers. The
checksum would come in an RFC822-ish "header block" *after* the content.

Oh, and your "patchup phase" wouldn't be that complicated. Just something
like:

  dumpfile.write('Checksum: ')
  pos = dumpfile.tell()
  dumpfile.write(CHECKSUM_PLACEHOLDER + '\n')
  ...
  copy content
  ...
  dumpfile.seek(0, pos)
  dumpfile.write(checksum.hexdigest())
  dumpfile.seek(2, 0)

You could also do:

  import FCNTL
  ...
  dumpfile.seek(FCNTL.SEEK_SET, ...
  ...
  dumpfile.seek(FCNTL.SEEK_END, ...

rather than 0 or 2.

>...

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
brane@xbc.nu writes:
> CVS on Windows creates RCS files with CRLF line endings. And that's
> a royal pain in the gluteus maximus because you can't migrate CVS
> repositories between Unix and Windows servers. It's also not quite
> clear to me what happens to line endings if you have a text file
> stored with -kb set, or what happens with real binary files.

Whoa -- you mean, not only the RCS metadata uses CRLF, but also the
texts of the revisions, even when they don't have -kb set?  That
sounds just plain wrong.  Supposedly RCS stores text files in Unix LF
format always, and -kb of course just stores whatever bytes you hand
it.

> Sounds like an interesting way to get a nervous breakdown, doesn't it?

Sounds like a familiar way to get a nervous breakdown :-(.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by br...@xbc.nu.
Quoting Karl Fogel <kf...@newton.ch.collab.net>:

> Branko Ä&#65533;ibej <br...@xbc.nu> writes:
> > >>+def get_md5(path):
> > >>+  """Return the hex md5 digest of file PATH."""
> > >>+  f = open(path, 'r')
> >
> > Crash!
> > 
> > Read files in binary mode... Write files in binary mode...
> > [repeat 50 times]
> > :-)
> 
> Thanks!  I forgot that Python makes that distinction by default.
> 
> Hmmm.  I've fixed all the open calls where it's appropriate, except
> for two cases where we're opening an RCS file, in visit_file() and in
> RevInfoParser.parse_cvs_file().  In the RCS file format, all
> whitespace outside strings is treated the same, which I suppose
> technically means that an RCS file on Windows *could* have either LF
> or CRLF, and could have either on Unix too.  In practice, I don't know
> whether CVS on different operating systems creates the RCS files
> differently, and even if it did, I'm not sure the rcsparse module
> would care, depending on how carefully it parses whitespace.

CVS on Windows creates RCS files with CRLF line endings. And that's a royal pain
in the gluteus maximus because you can't migrate CVS repositories between Unix
and Windows servers. It's also not quite clear to me what happens to line
endings if you have a text file stored with -kb set, or what happens with real
binary files.

Sounds like an interesting way to get a nervous breakdown, doesn't it?


> The easiest thing is probably just to try cvs2svn on Windows and see
> what happens.

"Hmm, what happens if I push the Big Red Button?" :-)

    Brane

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Branko Čibej <br...@xbc.nu> writes:
> >>+def get_md5(path):
> >>+  """Return the hex md5 digest of file PATH."""
> >>+  f = open(path, 'r')
>
> Crash!
> 
> Read files in binary mode... Write files in binary mode...
> [repeat 50 times]
> :-)

Thanks!  I forgot that Python makes that distinction by default.

Hmmm.  I've fixed all the open calls where it's appropriate, except
for two cases where we're opening an RCS file, in visit_file() and in
RevInfoParser.parse_cvs_file().  In the RCS file format, all
whitespace outside strings is treated the same, which I suppose
technically means that an RCS file on Windows *could* have either LF
or CRLF, and could have either on Unix too.  In practice, I don't know
whether CVS on different operating systems creates the RCS files
differently, and even if it did, I'm not sure the rcsparse module
would care, depending on how carefully it parses whitespace.

The easiest thing is probably just to try cvs2svn on Windows and see
what happens.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Branko Čibej <br...@xbc.nu>.
>
>
>On Mon, Apr 28, 2003 at 03:27:48PM -0500, kfogel@tigris.org wrote:
>  
>
>>...
>>+def get_md5(path):
>>+  """Return the hex md5 digest of file PATH."""
>>+  f = open(path, 'r')
>>
Crash!

Read files in binary mode... Write files in binary mode...
[repeat 50 times]
:-)


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
"Jeffrey C. Ollie" <je...@ollie.clive.ia.us> writes:
> The main problem with using os.path.{join|split} is that their behavior
> changes depending on platform.  Directly using posixpath ensures that
> '/' is used to separate directories and not ':' or '\'.  But depending
> on an undocumented module (even one as important as posixpath) probably
> isn't the best either.  Which is why I suggested using custom
> subroutines (potentially cribbed from posixpath).  With some
> subroutines, you can make sure that you don't get doubled slashes, etc.

No, I'm talking about the string module, not os.path.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by "Jeffrey C. Ollie" <je...@ollie.clive.ia.us>.
On Tue, 2003-04-29 at 08:54, Karl Fogel wrote:
> "Jeffrey C. Ollie" <je...@ollie.clive.ia.us> writes:
> > Doh! Open mouth, insert foot. Still, there seem to be a lot of places
> > where paths are pulled apart/put together by hand. I think that it would
> > save some headaches later if either the posixpath.* routines (os.path ==
> > posixpath on Unix) or a small set of subroutines were developed to
> > handle all of the path mangling.
> 
> Thanks for the tip, Jeff...  I don't see `posixpath' documented in the
> Python library reference        
> 
>    http://www.python.org/doc/current/lib/lib.html
> 
> (There are `posix' and `posixfile', both of are clearly not to be used
> in a portable script.)  Am I missing something?

posixpath is imported automatically when you import "os" on a Unix
system and is accessible via "os.path".  There are similar modules for
Windows and MacOS systems.  But yes, since posixpath is undocumented
it's hard to justify relying on it.

> I don't yet see any advantage to subroutines to do what split() and
> join() already do, but if our repository path manipulation starts
> getting more complex, then it might be worth it.

The main problem with using os.path.{join|split} is that their behavior
changes depending on platform.  Directly using posixpath ensures that
'/' is used to separate directories and not ':' or '\'.  But depending
on an undocumented module (even one as important as posixpath) probably
isn't the best either.  Which is why I suggested using custom
subroutines (potentially cribbed from posixpath).  With some
subroutines, you can make sure that you don't get doubled slashes, etc.

Jeff



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Branko Čibej <br...@xbc.nu>.
Karl Fogel wrote:

>"Jeffrey C. Ollie" <je...@ollie.clive.ia.us> writes:
>  
>
>>Doh! Open mouth, insert foot. Still, there seem to be a lot of places
>>where paths are pulled apart/put together by hand. I think that it would
>>save some headaches later if either the posixpath.* routines (os.path ==
>>posixpath on Unix) or a small set of subroutines were developed to
>>handle all of the path mangling.
>>    
>>
>
>Thanks for the tip, Jeff...  I don't see `posixpath' documented in the
>Python library reference        
>
Not surprising, as posixpath is just the implementation module for
os.path on POSIX systems. The logic looks a bit like this (in os.py):

    if (platform is posix)
       import posixpath
       os.path = posixpath
       del posixpath
    elif ...

I don't know I'd want to depend on an implementation detail, though.

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
"Jeffrey C. Ollie" <je...@ollie.clive.ia.us> writes:
> Doh! Open mouth, insert foot. Still, there seem to be a lot of places
> where paths are pulled apart/put together by hand. I think that it would
> save some headaches later if either the posixpath.* routines (os.path ==
> posixpath on Unix) or a small set of subroutines were developed to
> handle all of the path mangling.

Thanks for the tip, Jeff...  I don't see `posixpath' documented in the
Python library reference        

   http://www.python.org/doc/current/lib/lib.html

(There are `posix' and `posixfile', both of are clearly not to be used
in a portable script.)  Am I missing something?

I don't yet see any advantage to subroutines to do what split() and
join() already do, but if our repository path manipulation starts
getting more complex, then it might be worth it.

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by "Jeffrey C. Ollie" <je...@ollie.clive.ia.us>.
On Tue, 2003-04-29 at 00:48, Branko Čibej wrote:
> Jeffrey C. Ollie wrote:
> 
> >One thing that I noticed about the construction of ensure_directories is
> >that it's probably not very portable to Windows, to say nothing about
> >MacOS <= 9
> >
> Those are in-repository paths, not on-disk paths. Repository paths use /
> as the separator.

Doh! Open mouth, insert foot. Still, there seem to be a lot of places
where paths are pulled apart/put together by hand. I think that it would
save some headaches later if either the posixpath.* routines (os.path ==
posixpath on Unix) or a small set of subroutines were developed to
handle all of the path mangling.

Jeff



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Branko Čibej <br...@xbc.nu>.
Jeffrey C. Ollie wrote:

>One thing that I noticed about the construction of ensure_directories is
>that it's probably not very portable to Windows, to say nothing about
>MacOS <= 9
>
Those are in-repository paths, not on-disk paths. Repository paths use /
as the separator.


-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
"Jeffrey C. Ollie" <je...@ollie.clive.ia.us> writes:
> use Unix-style paths instead of using the os.path.* functions.    Also,
> I think that I fixed a buglet.  You had the following:
> 
> if path_so_far:
>   path_so_far = path + '/' + component
> else:
>   path_so_far = component
> 
> Using "path_so_far" rather than "path" in the second line would seem to
> be the proper way to go.  

Yup, that looks like a bug -- nice catch!  Will fix.

Thanks,
-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by "Jeffrey C. Ollie" <je...@ollie.clive.ia.us>.
On Mon, 2003-04-28 at 17:51, Karl Fogel wrote:
> Greg Stein <gs...@lyra.org> writes:
> > > +  path_so_far = None
> > > +  components = string.split(path, '/')
> > 
> > One of the things that I like to do is:
> > 
> >   components = filter(None, string.split(path, '/'))
> > 
> > That filter() usage will filter out all empty strings, which means it
> > ignores leading and trailing slashes, and double-slashes. Kinda nifty :-)
> 
> Ooooh, I like it.

One thing that I noticed about the construction of ensure_directories is
that it's probably not very portable to Windows, to say nothing about
MacOS <= 9 (although that's probably not a big market segment for
cvs2svn!). I also noticed that a lot of other functions are hardcoded to
use Unix-style paths instead of using the os.path.* functions.    Also,
I think that I fixed a buglet.  You had the following:

if path_so_far:
  path_so_far = path + '/' + component
else:
  path_so_far = component

Using "path_so_far" rather than "path" in the second line would seem to
be the proper way to go.  Anyway here's a proposed replacement that uses
os.path.* functions:

def ensure_directories(path, root, dumpfile):
  """Output to DUMPFILE any intermediate directories in PATH that are
  not already present under node ROOT, adding them to ROOT's tree as
  we go.  Return the last parent directory, that is, the parent
  of PATH's basename.  Leading slash(es) on PATH are optional."""

  # split the drive letter off on Windows (Mac & Unix path_so_far will be '')
  # after normalizing the path to get rid of double slashes, etc.
  path_so_far, head = os.path.splitdrive(os.path.normpath(path))

  # build a list of the path's components, in reverse order, which will
  # be used as a stack later
  components = []  
  while head not in ['', os.sep, os.altsep]:
    head, tail = os.path.split(head)
    components.append(tail)

  # get rid of the very last component, which is at the bottom of
  # the stack
  del components[0]

  this_node = root

  while len(components) > 0:
    # pop a component off the stack
    component = components[-1]
    del components[-1]

    # string the paths together
    path_so_far = os.path.join(path_so_far, component)

    if component not in this_node.children:
      this_node.children[component] = Node()
      dumpfile.write("Node-path: %s\n" % path_so_far)
      dumpfile.write("Node-kind: dir\n")
      dumpfile.write("Node-action: add\n")
      dumpfile.write("Prop-content-length: 10\n")
      dumpfile.write("Content-length: 10\n")
      dumpfile.write("\n")
      dumpfile.write("PROPS-END\n")
      dumpfile.write("\n")
      dumpfile.write("\n")
    this_node = this_node.children[component]

  return this_node



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Branko Čibej <br...@xbc.nu> writes:
> /you should read the Python tutorial, then. :-)

Rather, /me should *remember* what he read in the Python tutorial ages
and ages ago :-).

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org


Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Branko Čibej <br...@xbc.nu>.
Karl Fogel wrote:

>Greg Stein <gs...@lyra.org> writes:
>  
>
>>ah!
>>
>>  for component in components[:-1]:
>>
>>That'll create a new slice of the list, excluding the last component.
>>    
>>
>
>/me sits stunned at the elegance of it all
>  
>
/you should read the Python tutorial, then. :-)

-- 
Brane Čibej   <br...@xbc.nu>   http://www.xbc.nu/brane/


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Greg Stein <gs...@lyra.org> writes:
> ah!
> 
>   for component in components[:-1]:
> 
> That'll create a new slice of the list, excluding the last component.

/me sits stunned at the elegance of it all

> anydbm should be fine. 'marshal' is just fine and very fast, if you stick
> with native Python datatypes (list, dict, string, integer, etc). You only
> need Pickle if you need to serialize class instances. My understanding is
> that cPickle is nearly as fast as marshal, but I doubt you'll need much
> beyond a "dictionary" that looks like:
> 
>   { "path1": None, "path2": None }
> 
> And you just test with .has_key(path).  (I used None in the example cuz you
> don't need a value(?); just the key)

Thanks!

It's gonna be a little more complicated than that, of course, since in
the Subversion filesystem, deleting a path is just a single call to
delete(PATH), yet means that everything under PATH goes away in head.
Since our mini-database has to represent the head tree at all times,
it needs to make all the paths underneath PATH go away when PATH goes
away -- which was the beauty of the current Node implementation, ah
well, easy come easy go :-).

> Be wary of anydbm. If that defaults to dumbdbm, then you're going to end up
> with the index in memory. Kaboom! Suggestion:
> 
>   import anydbm
>   
>   if anydbm._defaultmod.__name__ == 'dumbdbm':
>     print 'ERROR: your installation of Python does not contain a proper'
>     print '       DBM module. This script cannot continue.'
>     print '       to solve: see blah blah blah'
>     sys.exit(1)
>
> On my RH 7.2 installation, Python has a BDB module. I'd expect that you'll
> be fine with the DBM restriction.

Oh!  I thought dumbdbm was more along the lines of CSV text files or
something.  Yes, let's insist on a real DBM then :-).

-Karl

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Daniel Berlin <db...@dberlin.org>.
> 
>> ...
>> (If you have any suggestions for a particularly Python-ish way to do
>> it, they are welcome of course.  I was just going to use `anydbm' and
>> come up with some simple representation, perhaps involving pickling.)
>
> anydbm should be fine. 'marshal' is just fine and very fast, if you 
> stick
> with native Python datatypes (list, dict, string, integer, etc). You 
> only
> need Pickle if you need to serialize class instances.
or if things are shared, or you have a cyclic structure.

> My understanding is
> that cPickle is nearly as fast as marshal, but I doubt you'll need much
> beyond a "dictionary" that looks like:
>
>   { "path1": None, "path2": None }
>
> And you just test with .has_key(path).  (I used None in the example 
> cuz you
> don't need a value(?); just the key)
>
Just make sure you don't have cyclic data structures, and don't want 
shared instances to still be shared when you unmarshal.

 From the top of marshal:
"/* Write Python objects to files and read them back.
    This is intended for writing and reading compiled Python code only;
    a true persistent storage facility would be much harder, since
    it would have to take circular links and sharing into account. */"
Just to prove the comment is valid:
[dberlin@dberlin Python]$ python
Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
[GCC 3.2.2 20030222 (Red Hat Linux 3.2.2-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> import marshal
 >>> a={}
 >>> b={}
 >>> a["a"]=b
 >>> b["a"]=a
 >>> marshal.dumps(a)
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
ValueError: object too deeply nested to marshal
 >>> marshal.dumps(b)
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
ValueError: object too deeply nested to marshal
 >>> import pickle
 >>> pickle.dumps(a)
"(dp0\nS'a'\np1\n(dp2\ng1\ng0\nss."
 >>> pickle.dumps(b)
"(dp0\nS'a'\np1\n(dp2\ng1\ng0\nss."
 >>>
> Be wary of anydbm. If that defaults to dumbdbm, then you're going to 
> end up
> with the index in memory. Kaboom! Suggestion:
>
>   import anydbm
>
>   if anydbm._defaultmod.__name__ == 'dumbdbm':
>     print 'ERROR: your installation of Python does not contain a 
> proper'
>     print '       DBM module. This script cannot continue.'
>     print '       to solve: see blah blah blah'
>     sys.exit(1)
>
> On my RH 7.2 installation, Python has a BDB module. I'd expect that 
> you'll
> be fine with the DBM restriction.
Python has multiple BDB modules, last i looked. :)

>
> Cheers,
> -g
>
> -- 
> Greg Stein, http://www.lyra.org/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
> For additional commands, e-mail: dev-help@subversion.tigris.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Greg Stein <gs...@lyra.org>.
On Mon, Apr 28, 2003 at 05:51:44PM -0500, Karl Fogel wrote:
>...
> > So... that would change to:
> > 
> >   path = string.lstrip(path, '/')
> 
> Ah, hmmm.  I was wondering about that.  The above usage is what I
> tried first, but it gave an error in Python 2.2.2... What to do?

Whoops. It errors in 1.5.2, also. lstrip() just removes leading whitespace.
You can't give it other characters to remove.

Ah well. You like the filter() idiom, which does what you want anyways.

>...
> > > +  last_idx = len(components) - 1
> > > +  this_node = root
> > > +
> > > +  i = 0
> > > +  while (i < last_idx):
> > > +
> > > +    component = components[i]
> > 
> > I don't think you need the "i" logic. Just do:
> > 
> >   for component in components:
> 
> That's what I started with, actually -- but I need to know when I hit
> the component right *before* the last component.  I wish loops had a
> 'next_to_last' keyword or something :-).

ah!

  for component in components[:-1]:

That'll create a new slice of the list, excluding the last component.

> > > +    if path_so_far:
> > > +      path_so_far += '/' + component
> > 
> > The += construct is also Python 2.0 based.
> 
> Okay.  Mike Pilato also pointed this out, but I didn't realize at the
> time that we were within spitting distance of 1.5.2 compatibility.
> I'll get rid of the "+=" stuff.

Right. "spitting distance" is the key. Since we are so close, we may as well
stay there. There are a lot of Python installations still on 1.5.2 (for
example, all RedHat distros prior to 8.0).

>...
> The other problem is that Node tree.  Because Subversion uses paths
> for branches, the full size of the node tree is much greater than just
> the tree one would get from

Woah! Forgot about that. Yah... can't keep that in memory.

>...
> (If you have any suggestions for a particularly Python-ish way to do
> it, they are welcome of course.  I was just going to use `anydbm' and
> come up with some simple representation, perhaps involving pickling.)

anydbm should be fine. 'marshal' is just fine and very fast, if you stick
with native Python datatypes (list, dict, string, integer, etc). You only
need Pickle if you need to serialize class instances. My understanding is
that cPickle is nearly as fast as marshal, but I doubt you'll need much
beyond a "dictionary" that looks like:

  { "path1": None, "path2": None }

And you just test with .has_key(path).  (I used None in the example cuz you
don't need a value(?); just the key)

Be wary of anydbm. If that defaults to dumbdbm, then you're going to end up
with the index in memory. Kaboom! Suggestion:

  import anydbm
  
  if anydbm._defaultmod.__name__ == 'dumbdbm':
    print 'ERROR: your installation of Python does not contain a proper'
    print '       DBM module. This script cannot continue.'
    print '       to solve: see blah blah blah'
    sys.exit(1)

On my RH 7.2 installation, Python has a BDB module. I'd expect that you'll
be fine with the DBM restriction.

Cheers,
-g

-- 
Greg Stein, http://www.lyra.org/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org

Re: svn commit: rev 5750 - branches/cvs2svn-kfogel/tools/cvs2svn

Posted by Karl Fogel <kf...@newton.ch.collab.net>.
Greg Stein <gs...@lyra.org> writes:
> This string method usage requires Python 2.0, whereas cvs2svn.py used to
> only require Python 1.5.2. And with the future dropping of the bindings,
> then it will be even easier to only do 1.5.2 (while our bindings can work
> against 1.5.2, I'm not sure how many people try that).
> 
> So... that would change to:
> 
>   path = string.lstrip(path, '/')

Ah, hmmm.  I was wondering about that.  The above usage is what I
tried first, but it gave an error in Python 2.2.2... What to do?

     $ python
     Python 2.2.2 (#1, Feb 19 2003, 12:12:23) 
     [GCC 2.96 20000731 (Red Hat Linux 7.0)] on linux2
     Type "help", "copyright", "credits" or "license" for more information.
     >>> import string
     >>> path = "/path/to/somewhere"
     >>> path = string.lstrip(path, '/')
     Traceback (most recent call last):
       File "<stdin>", line 1, in ?
     TypeError: lstrip() takes exactly 1 argument (2 given)
     >>> path = path.lstrip('/')
     >>> path
     'path/to/somewhere'
     >>> 
     $ 

> > +  path_so_far = None
> > +  components = string.split(path, '/')
> 
> One of the things that I like to do is:
> 
>   components = filter(None, string.split(path, '/'))
> 
> That filter() usage will filter out all empty strings, which means it
> ignores leading and trailing slashes, and double-slashes. Kinda nifty :-)

Ooooh, I like it.

> > +  last_idx = len(components) - 1
> > +  this_node = root
> > +
> > +  i = 0
> > +  while (i < last_idx):
> > +
> > +    component = components[i]
> 
> I don't think you need the "i" logic. Just do:
> 
>   for component in components:

That's what I started with, actually -- but I need to know when I hit
the component right *before* the last component.  I wish loops had a
'next_to_last' keyword or something :-).

> > +    if path_so_far:
> > +      path_so_far += '/' + component
> 
> The += construct is also Python 2.0 based.

Okay.  Mike Pilato also pointed this out, but I didn't realize at the
time that we were within spitting distance of 1.5.2 compatibility.
I'll get rid of the "+=" stuff.

> > +def get_md5(path):
> > +  """Return the hex md5 digest of file PATH."""
> > +  f = open(path, 'r')
> > +  checksum = md5.new()
> > +  buf = f.read(102400)
> > +  while buf:
> > +    checksum.update(buf)
> > +    buf = f.read(102400)
> > +  f.close()
> > +  return checksum.hexdigest()
> 
> There isn't a real need to close the file, as it will happen when "f" goes
> out of scope.

Thanks.

> >...
> > +    # Make the dumper's temp directory for this run.  RCS working
> > +    # files get checked out into here.
> > +    os.mkdir(self.tmpdir)
> 
> Couldn't you use 'co -p' and avoid temp files altogether?

Right now, it needs the checksum before it writes the file data to the
dumpfile (see the comment above that area), since the checksums go in
the headers.  In the plan to backpatch the checksums, then yes, it
could become a stream again.

[Oh wait, I just saw that you address that issue a bit further down...]

> >...
> > +    # Anything ending in ".1" is a new file.
> > +    #
> > +    # ### We could also use the parent_node to determine this.
> > +    # ### Maybe we should, too, because ".1" is not perfectly
> > +    # ### reliable, because of 'cvs commit -r'...
> > +    if re.match('.*\\.1$', cvs_rev):
> 
> Bah. Perl-style overkill :-)
> 
>   if cvs_rev[-2:] == '.1':

Heh.  Will remember (although this logic may be about to go away).

> > +    self.dumpfile.write('Node-path: %s\n' % svn_path)
> > +    self.dumpfile.write('Node-kind: file\n')
> > +    self.dumpfile.write('Node-action: %s\n' % action)
> > +    self.dumpfile.write('Prop-content-length: %d\n' % 10)  ### svn:executable?
> > +    self.dumpfile.write('Text-content-length: %d\n' % os.path.getsize(working))
> > +    self.dumpfile.write('Text-content-md5: %s\n' % get_md5(working))
> > +    self.dumpfile.write('Content-length: %d\n' % 0) # todo
> > +    self.dumpfile.write('\n')
> > +    self.dumpfile.write('PROPS-END\n')
> 
> Might be interesting to have a little helper function/object that writes out
> all this stuff, and can be shared between here and the Node class.

Yeah, not sure yet what level of factorization will be appropriate,
but will keep an eye out.

> Right. This plays into the use of 'co -p', too. Nominally, it would be great
> to be able to do:
> 
>   f = os.popen('co -p ...')
>   checksum = md5.new()
>   buf = f.read(CHUNK_SIZE)
>   while buf:
>     checksum.update(buf)
>     dumpfile.write(buf)
>     buf = f.read(CHUNK_SIZE)
>   digest = checksum.hexdigest()
> 
> The problem is going back. You can use dumpfile.tell() to get a seek
> position, and then do the backwards seek stuff. Another alternative would be
> taking a direction from HTTP and have the notion of "trailing" headers. The
> checksum would come in an RFC822-ish "header block" *after* the content.
> 
> Oh, and your "patchup phase" wouldn't be that complicated. Just something
> like:
> 
>   dumpfile.write('Checksum: ')
>   pos = dumpfile.tell()
>   dumpfile.write(CHECKSUM_PLACEHOLDER + '\n')
>   ...
>   copy content
>   ...
>   dumpfile.seek(0, pos)
>   dumpfile.write(checksum.hexdigest())
>   dumpfile.seek(2, 0)
> 
> You could also do:
> 
>   import FCNTL
>   ...
>   dumpfile.seek(FCNTL.SEEK_SET, ...
>   ...
>   dumpfile.seek(FCNTL.SEEK_END, ...
> 
> rather than 0 or 2.

Zing!  You have solved one of the two big problems, sir.  Thank you,
I'll do the backpatching in real time.

The other problem is that Node tree.  Because Subversion uses paths
for branches, the full size of the node tree is much greater than just
the tree one would get from

   $ ls -R cvs_repository/

Rather, it's that repository tree multiplied (more or less) by the
number of unique branches and tags in the RCS files.  That's biiig.
So farming the Node tree out to disk is not just a "nice to have"
anymore, it's a requirement for graduation :-).

(If you have any suggestions for a particularly Python-ish way to do
it, they are welcome of course.  I was just going to use `anydbm' and
come up with some simple representation, perhaps involving pickling.)

Thanks for the quick review, especially so early in the rewrite,
-K


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@subversion.tigris.org
For additional commands, e-mail: dev-help@subversion.tigris.org