You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mod_python-dev@quetz.apache.org by Alexis Marrero <am...@mitre.org> on 2005/11/06 13:06:00 UTC

mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

All,

The current 3.1 mod_python implementation of  
mod_python.util.StorageField.read_to_boudary reads as follows:

    203      def read_to_boundary(self, req, boundary, file):
    204          delim = ""
    205          line = req.readline()
    206          sline = line.strip()
    207          last_bound = boundary + "--"
    208          while line and sline != boundary and sline !=  
last_bound:
    209              odelim = delim
    210              if line[-2:] == "\r\n":
    211                  delim = "\r\n"
    212                  line = line[:-2]
    213              elif line[-1:] == "\n":
    214                  delim = "\n"
    215                  line = line[:-1]
    216              file.write(odelim + line)
    217              line = req.readline()
    218              sline = line.strip()

As we have discussed previously:
http://www.modpython.org/pipermail/mod_python/2005-March/017754.html
http://www.modpython.org/pipermail/mod_python/2005-March/017756.html
http://www.modpython.org/pipermail/mod_python/2005-November/019460.html

This triggered couple of changes in mod_python 3.2 Beta which reads  
as follows:
     33  # Fixes memory error when upload large files such as 700+MB  
ISOs.
     34  readBlockSize = 65368
     35
...
    225     def read_to_boundary(self, req, boundary, file):
...
    234         delim = ''
    235         lastCharCarried = False
    236         last_bound = boundary + '--'
    237         roughBoundaryLength = len(last_bound) + 128
    238         line = req.readline(readBlockSize)
    239         lineLength = len(line)
    240         if lineLength < roughBoundaryLength:
    241             sline = line.strip()
    242         else:
    243             sline = ''
    244         while lineLength > 0 and sline != boundary and sline ! 
= last_bound:
    245             if not lastCharCarried:
    246                 file.write(delim)
    247                 delim = ''
    248             else:
    249                 lastCharCarried = False
    250             cutLength = 0
    251             if lineLength == readBlockSize:
    252                 if line[-1:] == '\r':
    253                     delim = '\r'
    254                     cutLength = -1
    255                     lastCharCarried = True
    256             if line[-2:] == '\r\n':
    257                 delim += '\r\n'
    258                 cutLength = -2
    259             elif line[-1:] == '\n':
    260                 delim += '\n'
    261                 cutLength = -1
    262             if cutLength != 0:
    263                 file.write(line[:cutLength])
    264             else:
    265                 file.write(line)
    266             line = req.readline(readBlockSize)
    267             lineLength = len(line)
    268             if lineLength < roughBoundaryLength:
    269                 sline = line.strip()
    270             else:
    271                 sline = ''

This function has a mysterious bug in it... For some files which I  
could disclose (one of them been the PDF file for Apple's Pages User  
Manual in Italian) the uploaded file in the server ends up with the  
same length but different sha512 (the only digest that I'm using).   
The problem is a '\r' in the middle of a chunk of data that is much  
larger than readBlockSize.

Anyhow, I wrote a new function, which I believe is much simpler, and  
test it with thousands and thousands of different files and so far it  
seems to work fine.  It reads as follows:

def read_to_boundary(self, req, boundary, file):
     ''' read from the request object line by line with a maximum size,
         until the new line starts with boundary
     '''
     previous_delimiter = ''
     while 1:
         line = req.readline(1<<16)
         if line.startswith(boundary):
             break

         if line.endswith('\r\n'):
             file.write(previous_delimiter + line[:-2])
             previous_delimiter = '\r\n'

         elif line.endswith('\r') or line.endswith('\n'):
             file.write(previous_delimiter + line[:-1])
             previous_delimiter = line[-1:]

         else:
             file.write(previous_delimiter + line)
             previous_delimiter = ''

Let me know any comments on it and if you test it and fails please  
also let me know. I don't have subversion account neither I don't  
know how to use it thus this email.

/amn

_______________________________________________
Mod_python mailing list
Mod_python@modpython.org
http://mailman.modpython.org/mailman/listinfo/mod_python

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

I've been spending some quality time with hexedit, vim and a little bit 
of python. I can now generate a file which can be used in the unit test.

The problem seems to occur when a '\r' character is right at 
readBlockSize boundary, which is 65368 in the current mod_python.util.

I have not yet tested Alexis's fix yet. It's possible that it doesn't 
fix the problem but is just masking it since his code uses a 
readBlockSize of 65536, which will not be a problem for ugh.pdf.

The attached script will generate a file that can be used for testing. I 
haven't tried to integrate it into Nicolas's unit test as I wanted to 
share my findings with everyone asap. I'll spend some more time on this 
tommorrow.

Happy Bug Hunting,
Jim

Nicolas Lehuen wrote:
> Hi guys,
> 
> In the pure "if it ain't tested, it ain't fixed" fashion, I've added a 
> unit test for file upload to the test suite. It uploads a randomly 
> generated 1 MB file to the server, and check that the MD5 digest 
> returned by the server is correct. I could not reproduce Alexis' bug 
> report this way, however. But I then I added a test with the UNIX-HATERS 
> handbook file ugh.pdf, and bang, here comes the bug.
> 
> I've checked in both unit tests into subversion, so that you can play 
> with them. I'm now going to test Alexis' fix.
> 
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Nicolas Lehuen <ni...@gmail.com>.

OK, this time I think the file upload problem is solved for good. I've
checked-in Alexis's code, with comments. Then I've done a quick
rewrite of the multipart/form-data parser found in
FieldStorage.__init__ and read_to_boundary so that it uses a regexp
for the boundary checks, with the hope that it simplify the code a
little bit (and remove thos nasty strip() calls). I've re-ran all
tests and everything seems OK. Now, Alexis, if you can run your test
with the 60,000+ files on this version of util.py, that would be great
:

http://svn.apache.org/viewcvs.cgi/httpd/mod_python/trunk/lib/python/mod_python/util.py?rev=332779&view=markup

Ironically, the JIRA server seems non responsive right now, so I can't
comment or close MODPYTHON-40...

Regards,
Nicolas

2005/11/12, Nicolas Lehuen <ni...@gmail.com>:
> Guys, next time please use the JIRA bug tracking application available at :
>
> http://nagoya.apache.org/jira/browse/MODPYTHON
>
> Especially this bug report :
>
> http://issues.apache.org/jira/browse/MODPYTHON-40
>
> I'm currently re-reading the whole thread and trying to make sense of
> all the test files and patches that were submitted, and it's quite
> unpleasant, especially since I'm using Google mail which tries to be
> too clever about attachments.
>
> Using JIRA, we have a permanent tracking of what's happening AND we
> can store attachments in a nice way.
>
> Regards,
> Nicolas
>
> 2005/11/8, Alexis Marrero <am...@mitre.org>:
> > Inspired by Mike's changes I made some changes the "new" version to
> > improve performance while keeping readability:
> >
> > def read_to_boundary_new(self, req, boundary, file, readBlockSize):
> >      previous_delimiter = ''
> >      bound_length = len(boundary)
> >      while 1:
> >          line = req.readline(readBlockSize)
> >          if line[:bound_length] == boundary:
> >              break
> >
> >          if line[-2:] == '\r\n':
> >              file.write(previous_delimiter + line[:-2])
> >              previous_delimiter = '\r\n'
> >
> >          elif line[-1:] == '\r':
> >              file.write(previous_delimiter + line[:-1])
> >              previous_delimiter = '\r'
> >
> >          elif line[-1:] == '\n':
> >              if len(line[:-1]) > 0:
> >                  file.write(previous_delimiter + line[:-1])
> >                  previous_delimiter = '\n'
> >
> >              else:
> >                  previous_delimiter += '\n'
> >
> >          else:
> >              file.write(previous_delimiter + line)
> >              previous_delimiter = ''
> >
> > Results are very comparable to read_to_boundary_mike
> >
> > /amn
> > On Nov 8, 2005, at 9:07 AM, Jim Gallacher wrote:
> >
> > > Mike Looijmans wrote:
> > >
> > >> I've attached a modified upload_test_harness.py that includes the
> > >> new and current, also the 'org' version (as in 3.1 release) and
> > >> the 'mike' version.
> > >>
> > >
> > > Nice changes, Mike.
> > >
> > > I started to get confused by the names of the various
> > > read_to_boundary_* functions, so I've made a slight modification to
> > > your code and renamed these functions.
> > >
> > > Note also that there are some failures for the split_boundary file
> > > using offset -2 chunk [CR], so I've change the default offset to
> > > include this condition.
> > >
> > > I've also changed the generate_*file functions to embed an
> > > additional '\r' in a short string. Just being paranoid I guess.
> > >
> > > Diff is attached.
> > >
> > > If it's useful I can stick the upload_test_harness code in the
> > > mod_python svn repository.
> > >
> > > Jim
> > >
> > > --- upload_test_harness_mike.py.orig    2005-11-08
> > > 08:02:13.000000000 -0500
> > > +++ upload_test_harness_mike.py    2005-11-08 08:54:15.000000000 -0500
> > > @@ -14,8 +14,8 @@
> > >  ##    f.write('b'*block_size)
> > >  ##    f.close()
> > >
> > > -def read_to_boundary_current(self, req, boundary, file,
> > > readBlockSize):
> > > -    ''' currrent version '''
> > > +def read_to_boundary_324(self, req, boundary, file, readBlockSize):
> > > +    ''' currrent version  in 3.2.4b'''
> > >      #
> > >      # Although technically possible for the boundary to be split
> > > by the read, this will
> > >      # not happen because the readBlockSize is set quite high - far
> > > longer than any boundary line
> > > @@ -66,7 +66,7 @@
> > >          else:
> > >              sline = ''
> > >
> > > -def read_to_boundary_new(self, req, boundary, file, readBlockSize):
> > > +def read_to_boundary_alexis(self, req, boundary, file,
> > > readBlockSize):
> > >      ''' Alexis' version
> > >          read from the request object line by line with a maximum
> > > size,
> > >          until the new line starts with boundary
> > > @@ -89,7 +89,7 @@
> > >              file.write(previous_delimiter + line)
> > >              previous_delimiter = ''
> > >
> > > -def read_to_boundary_org(self, req, boundary, file, readBlockSize):
> > > +def read_to_boundary_314(self, req, boundary, file, readBlockSize):
> > >      delim = ""
> > >      line = req.readline(readBlockSize)
> > >      while line and not line.startswith(boundary):
> > > @@ -145,6 +145,8 @@
> > >
> > >      f = open(fname, 'wb')
> > >      f.write('a'*50)
> > > +    f.write('\r')
> > > +    f.write('a'*50)
> > >      f.write('\r\n')
> > >
> > >      block_size =  readBlockSize + offset
> > > @@ -163,6 +165,8 @@
> > >      """
> > >      f = open(fname, 'wb')
> > >      f.write('a'*50)
> > > +    f.write('\r')
> > > +    f.write('a'*50)
> > >      f.write('\r\n')
> > >
> > >      block_size =  readBlockSize + offset
> > > @@ -171,7 +175,7 @@
> > >
> > >      f.close()
> > >
> > > -read_boundaries = [read_to_boundary_current, read_to_boundary_new,
> > > read_to_boundary_org, read_to_boundary_mike]
> > > +read_boundaries = [read_to_boundary_324, read_to_boundary_alexis,
> > > read_to_boundary_314, read_to_boundary_mike]
> > >
> > >  def main(file_generator, offset, chunk, block_size=1<<16):
> > >      fname_in = 'testfile.in'
> > > @@ -218,6 +222,9 @@
> > >          pass
> > >
> > >  if __name__ == '__main__':
> > > +    # offsets which show problems are in [-2, -1]
> > > +    first_offset = -2
> > > +    last_offset = 0
> > >
> > >      #test_chunks =  ['', '\r', '\n', '\r\n']
> > >
> > > @@ -229,7 +236,7 @@
> > >          print '='*40
> > >          print file_gen_obj.__name__
> > >          for chunk in test_chunks:
> > > -            for i in range(-1, 0):
> > > +            for i in range(first_offset, last_offset):
> > >                  print '-'*40
> > >                  print 'test offset', i, 'chunk',[ cname(c) for c
> > > in chunk ]
> > >                  print '-'*40
> > >
> >
> >
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Nicolas Lehuen <ni...@gmail.com>.

Guys, next time please use the JIRA bug tracking application available at :

http://nagoya.apache.org/jira/browse/MODPYTHON

Especially this bug report :

http://issues.apache.org/jira/browse/MODPYTHON-40

I'm currently re-reading the whole thread and trying to make sense of
all the test files and patches that were submitted, and it's quite
unpleasant, especially since I'm using Google mail which tries to be
too clever about attachments.

Using JIRA, we have a permanent tracking of what's happening AND we
can store attachments in a nice way.

Regards,
Nicolas

2005/11/8, Alexis Marrero <am...@mitre.org>:
> Inspired by Mike's changes I made some changes the "new" version to
> improve performance while keeping readability:
>
> def read_to_boundary_new(self, req, boundary, file, readBlockSize):
>      previous_delimiter = ''
>      bound_length = len(boundary)
>      while 1:
>          line = req.readline(readBlockSize)
>          if line[:bound_length] == boundary:
>              break
>
>          if line[-2:] == '\r\n':
>              file.write(previous_delimiter + line[:-2])
>              previous_delimiter = '\r\n'
>
>          elif line[-1:] == '\r':
>              file.write(previous_delimiter + line[:-1])
>              previous_delimiter = '\r'
>
>          elif line[-1:] == '\n':
>              if len(line[:-1]) > 0:
>                  file.write(previous_delimiter + line[:-1])
>                  previous_delimiter = '\n'
>
>              else:
>                  previous_delimiter += '\n'
>
>          else:
>              file.write(previous_delimiter + line)
>              previous_delimiter = ''
>
> Results are very comparable to read_to_boundary_mike
>
> /amn
> On Nov 8, 2005, at 9:07 AM, Jim Gallacher wrote:
>
> > Mike Looijmans wrote:
> >
> >> I've attached a modified upload_test_harness.py that includes the
> >> new and current, also the 'org' version (as in 3.1 release) and
> >> the 'mike' version.
> >>
> >
> > Nice changes, Mike.
> >
> > I started to get confused by the names of the various
> > read_to_boundary_* functions, so I've made a slight modification to
> > your code and renamed these functions.
> >
> > Note also that there are some failures for the split_boundary file
> > using offset -2 chunk [CR], so I've change the default offset to
> > include this condition.
> >
> > I've also changed the generate_*file functions to embed an
> > additional '\r' in a short string. Just being paranoid I guess.
> >
> > Diff is attached.
> >
> > If it's useful I can stick the upload_test_harness code in the
> > mod_python svn repository.
> >
> > Jim
> >
> > --- upload_test_harness_mike.py.orig    2005-11-08
> > 08:02:13.000000000 -0500
> > +++ upload_test_harness_mike.py    2005-11-08 08:54:15.000000000 -0500
> > @@ -14,8 +14,8 @@
> >  ##    f.write('b'*block_size)
> >  ##    f.close()
> >
> > -def read_to_boundary_current(self, req, boundary, file,
> > readBlockSize):
> > -    ''' currrent version '''
> > +def read_to_boundary_324(self, req, boundary, file, readBlockSize):
> > +    ''' currrent version  in 3.2.4b'''
> >      #
> >      # Although technically possible for the boundary to be split
> > by the read, this will
> >      # not happen because the readBlockSize is set quite high - far
> > longer than any boundary line
> > @@ -66,7 +66,7 @@
> >          else:
> >              sline = ''
> >
> > -def read_to_boundary_new(self, req, boundary, file, readBlockSize):
> > +def read_to_boundary_alexis(self, req, boundary, file,
> > readBlockSize):
> >      ''' Alexis' version
> >          read from the request object line by line with a maximum
> > size,
> >          until the new line starts with boundary
> > @@ -89,7 +89,7 @@
> >              file.write(previous_delimiter + line)
> >              previous_delimiter = ''
> >
> > -def read_to_boundary_org(self, req, boundary, file, readBlockSize):
> > +def read_to_boundary_314(self, req, boundary, file, readBlockSize):
> >      delim = ""
> >      line = req.readline(readBlockSize)
> >      while line and not line.startswith(boundary):
> > @@ -145,6 +145,8 @@
> >
> >      f = open(fname, 'wb')
> >      f.write('a'*50)
> > +    f.write('\r')
> > +    f.write('a'*50)
> >      f.write('\r\n')
> >
> >      block_size =  readBlockSize + offset
> > @@ -163,6 +165,8 @@
> >      """
> >      f = open(fname, 'wb')
> >      f.write('a'*50)
> > +    f.write('\r')
> > +    f.write('a'*50)
> >      f.write('\r\n')
> >
> >      block_size =  readBlockSize + offset
> > @@ -171,7 +175,7 @@
> >
> >      f.close()
> >
> > -read_boundaries = [read_to_boundary_current, read_to_boundary_new,
> > read_to_boundary_org, read_to_boundary_mike]
> > +read_boundaries = [read_to_boundary_324, read_to_boundary_alexis,
> > read_to_boundary_314, read_to_boundary_mike]
> >
> >  def main(file_generator, offset, chunk, block_size=1<<16):
> >      fname_in = 'testfile.in'
> > @@ -218,6 +222,9 @@
> >          pass
> >
> >  if __name__ == '__main__':
> > +    # offsets which show problems are in [-2, -1]
> > +    first_offset = -2
> > +    last_offset = 0
> >
> >      #test_chunks =  ['', '\r', '\n', '\r\n']
> >
> > @@ -229,7 +236,7 @@
> >          print '='*40
> >          print file_gen_obj.__name__
> >          for chunk in test_chunks:
> > -            for i in range(-1, 0):
> > +            for i in range(first_offset, last_offset):
> >                  print '-'*40
> >                  print 'test offset', i, 'chunk',[ cname(c) for c
> > in chunk ]
> >                  print '-'*40
> >
>
>

Re: More efficient StringField and Field classes

Posted by Nicolas Lehuen <ni...@gmail.com>.

I've ran the unit test on Windows with your patch and everything seems fine.

Regards,
Nicolas

2005/11/25, Mike Looijmans <nl...@natlab.research.philips.com>:
> Is there a way that I can run the unit tests on a Windows systems? I
> have plenty of Linux boxes here, but I'd have to build and install
> apache in a "private" corner just to run the tests.
>
> Anyway, I see where the problem is, in tests.py it does:
>
> def util_fieldstorage(req):
>      from mod_python import util
>      req.write(`util.FieldStorage(req).list`)
>      return apache.OK
>
> So I'll just add a __repr__ function to the StringField class and the
> test should pass then.
>
> I have attached a patch (on the 3.2.5b version of util.py) to include
> this change. That should fix the unit test, and it did not break any of
> my own code.
>
>
> Jim Gallacher wrote:
> > Hi Mike,
> >
> > I don't have time to dig into this tonight, but your patch causes one of
> > the unit tests to fail.
> >
> > FAIL: test_util_fieldstorage (__main__.PerRequestTestCase)
> > ----------------------------------------------------------------------
> > Traceback (most recent call last):
> >   File "test.py", line 1117, in test_util_fieldstorage
> >     self.fail(`rsp`)
> >   File "/usr/lib/python2.3/unittest.py", line 270, in fail
> >     raise self.failureException, msg
> > AssertionError: "['1', '2', '3', '4']"
> >
> >
> > Jim
> >
>
>
> --
> Mike Looijmans
> Philips Natlab / Topic Automation
>
>
> Index: util.py
> ===================================================================
> --- util.py     (revision 348746)
> +++ util.py     (working copy)
> @@ -48,19 +48,8 @@
>  """
>
>  class Field:
> -   def __init__(self, name, file, ctype, type_options,
> -                disp, disp_options, headers = {}):
> +   def __init__(self, name):
>         self.name = name
> -       self.file = file
> -       self.type = ctype
> -       self.type_options = type_options
> -       self.disposition = disp
> -       self.disposition_options = disp_options
> -       if disp_options.has_key("filename"):
> -           self.filename = disp_options["filename"]
> -       else:
> -           self.filename = None
> -       self.headers = headers
>
>     def __repr__(self):
>         """Return printable representation."""
> @@ -81,13 +70,34 @@
>         self.file.close()
>
>  class StringField(str):
> -   """ This class is basically a string with
> -   a value attribute for compatibility with std lib cgi.py
> -   """
> +    """ This class is basically a string with
> +    added attributes for compatibility with std lib cgi.py. Basically, this
> +    works the opposite of Field, as it stores its data in a string, but creates
> +    a file on demand. Field creates a value on demand and stores data in a file.
> +    """
> +    filename = None
> +    headers = {}
> +    ctype = "text/plain"
> +    type_options = {}
> +    disposition = None
> +    disp_options = None
> +
> +    # I wanted __init__(name, value) but that does not work (apparently, you
> +    # cannot subclass str with a constructor that takes >1 argument)
> +    def __init__(self, value):
> +        '''Create StringField instance. You'll have to set name yourself.'''
> +        str.__init__(self, value)
> +        self.value = value
>
> -   def __init__(self, str=""):
> -       str.__init__(self, str)
> -       self.value = self.__str__()
> +    def __getattr__(self, name):
> +        if name != 'file':
> +            raise AttributeError, name
> +        self.file = cStringIO.StringIO(self.value)
> +        return self.file
> +
> +    def __repr__(self):
> +       """Return printable representation (to pass unit tests)."""
> +       return "Field(%s, %s)" % (`self.name`, `self.value`)
>
>  class FieldStorage:
>
> @@ -103,8 +113,7 @@
>         if req.args:
>             pairs = parse_qsl(req.args, keep_blank_values)
>             for pair in pairs:
> -               file = cStringIO.StringIO(pair[1])
> -               self.list.append(Field(pair[0], file, "text/plain", {}, None, {}))
> +               self.add_field(pair[0], pair[1])
>
>         if req.method != "POST":
>             return
> @@ -123,9 +132,7 @@
>         if ctype.startswith("application/x-www-form-urlencoded"):
>             pairs = parse_qsl(req.read(clen), keep_blank_values)
>             for pair in pairs:
> -               # TODO : isn't this a bit heavyweight just for form fields ?
> -               file = cStringIO.StringIO(pair[1])
> -               self.list.append(Field(pair[0], file, "text/plain", {}, None, {}))
> +               self.add_field(pair[0], pair[1])
>             return
>
>         if not ctype.startswith("multipart/"):
> @@ -205,33 +212,44 @@
>             else:
>                 name = None
>
> +           # create a file object
>             # is this a file?
>             if disp_options.has_key("filename"):
>                 if file_callback and callable(file_callback):
>                     file = file_callback(disp_options["filename"])
>                 else:
> -                   file = self.make_file()
> +                   file = tempfile.TemporaryFile("w+b")
>             else:
>                 if field_callback and callable(field_callback):
>                     file = field_callback()
>                 else:
> -                   file = self.make_field()
> +                   file = cStringIO.StringIO()
>
>             # read it in
> -           end_of_stream = self.read_to_boundary(req, boundary, file)
> +           self.read_to_boundary(req, boundary, file)
>             file.seek(0)
> -
> +
>             # make a Field
> -           field = Field(name, file, ctype, type_options, disp, disp_options, headers)
> -
> +           if disp_options.has_key("filename"):
> +               field = Field(name)
> +               field.filename = disp_options["filename"]
> +           else:
> +               field = StringField(file.read())
> +               field.name = name
> +           field.file = file
> +           field.type = ctype
> +           field.type_options = type_options
> +           field.disposition = disp
> +           field.disposition_options = disp_options
> +           field.headers = headers
>             self.list.append(field)
>
> -   def make_file(self):
> -       return tempfile.TemporaryFile("w+b")
> +   def add_field(self, key, value):
> +       """Insert a field as key/value pair"""
> +       item = StringField(value)
> +       item.name = key
> +       self.list.append(item)
>
> -   def make_field(self):
> -       return cStringIO.StringIO()
> -
>     def read_to_boundary(self, req, boundary, file):
>          previous_delimiter = None
>          while True:
> @@ -288,11 +306,7 @@
>         found = []
>         for item in self.list:
>             if item.name == key:
> -               if isinstance(item.file, FileType) or \
> -                      isinstance(getattr(item.file, 'file', None), FileType):
> -                   found.append(item)
> -               else:
> -                   found.append(StringField(item.value))
> +               found.append(item)
>         if not found:
>             raise KeyError, key
>         if len(found) == 1:
> @@ -333,11 +347,7 @@
>         """ return the first value received """
>         for item in self.list:
>             if item.name == key:
> -               if isinstance(item.file, FileType) or \
> -                      isinstance(getattr(item.file, 'file', None), FileType):
> -                   return item
> -               else:
> -                   return StringField(item.value)
> +               return item
>         return default
>
>     def getlist(self, key):
> @@ -347,11 +357,7 @@
>         found = []
>         for item in self.list:
>             if item.name == key:
> -               if isinstance(item.file, FileType) or \
> -                      isinstance(getattr(item.file, 'file', None), FileType):
> -                   found.append(item)
> -               else:
> -                   found.append(StringField(item.value))
> +               found.append(item)
>         return found
>
>  def parse_header(line):
>
>
>

Re: More efficient StringField and Field classes

Posted by Mike Looijmans <nl...@natlab.research.philips.com>.

Is there a way that I can run the unit tests on a Windows systems? I 
have plenty of Linux boxes here, but I'd have to build and install 
apache in a "private" corner just to run the tests.

Anyway, I see where the problem is, in tests.py it does:

def util_fieldstorage(req):
     from mod_python import util
     req.write(`util.FieldStorage(req).list`)
     return apache.OK

So I'll just add a __repr__ function to the StringField class and the 
test should pass then.

I have attached a patch (on the 3.2.5b version of util.py) to include 
this change. That should fix the unit test, and it did not break any of 
my own code.

Jim Gallacher wrote:
> Hi Mike,
> 
> I don't have time to dig into this tonight, but your patch causes one of 
> the unit tests to fail.
> 
> FAIL: test_util_fieldstorage (__main__.PerRequestTestCase)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test.py", line 1117, in test_util_fieldstorage
>     self.fail(`rsp`)
>   File "/usr/lib/python2.3/unittest.py", line 270, in fail
>     raise self.failureException, msg
> AssertionError: "['1', '2', '3', '4']"
> 
> 
> Jim
> 

-- 
Mike Looijmans
Philips Natlab / Topic Automation

Re: More efficient StringField and Field classes

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Hi Mike,

I don't have time to dig into this tonight, but your patch causes one of 
the unit tests to fail.

FAIL: test_util_fieldstorage (__main__.PerRequestTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "test.py", line 1117, in test_util_fieldstorage
     self.fail(`rsp`)
   File "/usr/lib/python2.3/unittest.py", line 270, in fail
     raise self.failureException, msg
AssertionError: "['1', '2', '3', '4']"


Jim

Re: More efficient StringField and Field classes

Posted by Mike Looijmans <nl...@natlab.research.philips.com>.

It has been awhile, but I finally made the patch.
Advice and feedback is welcome.

What it does:
- Simplifies the creation of StringField objects. This was already
marked as a TODO in the 3.2.5b code.
- Does not create a file object (cStringIO) for each and every field.
- FieldStorage.get() will always return the same instance(s) for any
given name.
- FieldStorage.get() is very cheap now (does not create new objects,
which is expensive in Python)

It may break:
- Code that creates an instance of Field or StringField directly. This
can be fixed by extending the __init__() routine, but I don't think it's
needed.

It works:
- Perfect on all of my projects.

Attached:
- Patch created by SVN

-- 
Mike Looijmans
Philips Natlab / Topic Automation

Re: Developer Guidelines (was Re: More efficient FieldStorage, StringField and Field classes)

Posted by "Gregory (Grisha) Trubetskoy" <gr...@apache.org>.

On Tue, 15 Nov 2005, Mike Looijmans wrote:

> Then I get:
>
> error: Python was built with version 7.1 of Visual Studio, and extensions 
> need to be built with the same version of the compiler, but it isn't 
> installed.
>
> I don't have a VS license - that would mean I have to build my own Python in 
> order to get any further. Oh dear...

I seem to remember long ago that if you find the code in distutils that 
produces this message and comment out that check, it still works even with 
incompatible VS's.

Grisha

Re: Developer Guidelines (was Re: More efficient FieldStorage, StringField and Field classes)

Posted by David Fraser <da...@sjsoft.com>.

Mike Looijmans wrote:

> David Fraser wrote:
>
> > See my other mail - basically you do
> > cd dist
> > set APACHESRC=c:\path\to\apache  (where this is the Apache2 dir in a
> > standard installation, now need for apache source)
> > build_installer
>
> Okay, first I downloaded the .NET SDK. (Our internet connection is 
> great here)
>
> Then I get:
>
> error: Python was built with version 7.1 of Visual Studio, and 
> extensions need to be built with the same version of the compiler, but 
> it isn't installed.
>
> I don't have a VS license - that would mean I have to build my own 
> Python in order to get any further. Oh dear...

Strange, I have VS.NET 2003 installed here, and it doesn't give errors.
I think I got mod_python compiling at one stage with mingw so that is 
the other route you could go...

> I was wondering - since i won't be hackin' the C code (yet), would it 
> be sufficient to just install the binary, and use the .py files as 
> obtained from the SVN repository?
>
Yes you should be able to install the binary and run the tests from the 
repository (if you can get past the issues I had running them ...)

> The 3.2.5 beta seems to run just fine here. Cannot run the automated 
> tests though, because of the issues mentioned above.

Cheers
David

Re: Developer Guidelines (was Re: More efficient FieldStorage, StringField and Field classes)

Posted by Mike Looijmans <nl...@natlab.research.philips.com>.

David Fraser wrote:

 > See my other mail - basically you do
 > cd dist
 > set APACHESRC=c:\path\to\apache  (where this is the Apache2 dir in a
 > standard installation, now need for apache source)
 > build_installer

Okay, first I downloaded the .NET SDK. (Our internet connection is great 
here)

Then I get:

error: Python was built with version 7.1 of Visual Studio, and 
extensions need to be built with the same version of the compiler, but 
it isn't installed.

I don't have a VS license - that would mean I have to build my own 
Python in order to get any further. Oh dear...

I was wondering - since i won't be hackin' the C code (yet), would it be 
sufficient to just install the binary, and use the .py files as obtained 
from the SVN repository?

The 3.2.5 beta seems to run just fine here. Cannot run the automated 
tests though, because of the issues mentioned above.

-- 
Mike Looijmans
Philips Natlab / Topic Automation

Re: Developer Guidelines (was Re: More efficient FieldStorage, StringField and Field classes)

Posted by David Fraser <da...@sjsoft.com>.

Mike Looijmans wrote:

> Ok, following Jim's advice, an having read his guide, I have so far:
> - Fetched the trunk version with anonymous SVN (no problem)
>
> I've already installed apache 2.0.55 and Python 2.4.1 on this Windows 
> machine (the lazy way, with MSI packages).
>
> I've also installed the latest Cygwin (all "default" packages).
>
> However, I'm not getting anywhere in compiling mod_python. It requires 
> the apxs tool which is apparently not part of any Apache windows 
> distribution. So I grabbed the apache 2.0.55 source for windows, but 
> the configure script is now stuck for over half an hour on:
>
> Configuring Apache Portable Runtime Utility library...
>
> checking for APR-util... reconfig
> configuring package in srclib/apr-util now
>
> Nothing appears to happen.
>
> Anyone here who can help me out getting this up and running on Windows 
> XP?

See my other mail - basically you do
cd dist
set APACHESRC=c:\path\to\apache  (where this is the Apache2 dir in a 
standard installation, now need for apache source)
build_installer

Unfortunately I haven't got the tests working yet, but this produces an 
installer in dist/dist/

I wonder if it would be a good idea to have a wiki for mod_python, it 
can be a nice way of collecting documentation...

David

>
> I can also use linux machines for compile and tests (I'm not root on 
> any), but I do most development on this Windows system, so I need a 
> windows executable anyway.
>
> PS: Haven't found any speling errors in your document...
>
> Jim Gallacher wrote:
>
>> Mike Looijmans wrote:
>>
>>> Inspired by our FieldStorage work with file uploads, I also made 
>>> some implementation changes to Field and StringField in util.py. In 
>>> essense, that will make things like using form['key'] multiple times 
>>> much more efficient. Also, FieldStorage will return the same object 
>>> if called twice.
>>>
>>> I'd like some feedback from the group on these changes, and also 
>>> some directions on what I'd need to do to get them integrated in the 
>>> mod_python beta. I have no experience whatsoever in contributing to 
>>> open source projects.
>>
>>
>>
>> I think it's too late in the beta cycle to be rewriting code unless 
>> it's to fix a specific, critical bug. We really, really, *really* 
>> need to deliver 3.2 final, and I don't want to mess with working code 
>> at this point. The changes you are suggesting are likely more 
>> appropriate for 3.3, or possibly a 3.2 bugfix release.
>>
>> As far as contributing to mod_python, well, I think you already have 
>> the right idea: propose changes; discuss on python-dev; seek 
>> consensus; wait for your changes to be committed by myself or 
>> Nicolas. Bug fixes and code improvements are much more likely to be 
>> accepted than major new features. We want mod_python to be tightly 
>> focused on the core functionality of providing a python interface to 
>> apache. The fancy bells and whistles are best left to frameworks that 
>> utilize mod_python further up the application stack.
>>
>> For your specific changes, I would strongly suggest that you make your
>> changes against the mod_python development branch, ie trunk. Also people
>> are much more likely to test your changes if you submit a patch rather
>> than offering a completely new file. This makes it easier to see what 
>> code is being changed and avoids problems with regressions.
>>
>> I've put together a "Developer's Guide" draft that I'd like to 
>> eventually see included in either the documentation or on the 
>> website. This document doesn't represent any offical position of the 
>> mod_python developers, but rather just what I see as best practices 
>> and some resources for contributors. Everyone should feel free to 
>> offer suggestions to improve this guide. The original is in svn at:
>> http://svn.apache.org/repos/asf/httpd/mod_python/branches/jgallacher/docs/developers-guide.rst 
>>
>>
>> And no, I have not spell checked it yet. :)
>
>

Re: Developer Guidelines (was Re: More efficient FieldStorage, StringField and Field classes)

Posted by Mike Looijmans <nl...@natlab.research.philips.com>.

Ok, following Jim's advice, an having read his guide, I have so far:
- Fetched the trunk version with anonymous SVN (no problem)

I've already installed apache 2.0.55 and Python 2.4.1 on this Windows 
machine (the lazy way, with MSI packages).

I've also installed the latest Cygwin (all "default" packages).

However, I'm not getting anywhere in compiling mod_python. It requires 
the apxs tool which is apparently not part of any Apache windows 
distribution. So I grabbed the apache 2.0.55 source for windows, but the 
configure script is now stuck for over half an hour on:

Configuring Apache Portable Runtime Utility library...

checking for APR-util... reconfig
configuring package in srclib/apr-util now

Nothing appears to happen.

Anyone here who can help me out getting this up and running on Windows XP?

I can also use linux machines for compile and tests (I'm not root on 
any), but I do most development on this Windows system, so I need a 
windows executable anyway.

PS: Haven't found any speling errors in your document...

Jim Gallacher wrote:
> Mike Looijmans wrote:
> 
>> Inspired by our FieldStorage work with file uploads, I also made some 
>> implementation changes to Field and StringField in util.py. In 
>> essense, that will make things like using form['key'] multiple times 
>> much more efficient. Also, FieldStorage will return the same object if 
>> called twice.
>>
>> I'd like some feedback from the group on these changes, and also some 
>> directions on what I'd need to do to get them integrated in the 
>> mod_python beta. I have no experience whatsoever in contributing to 
>> open source projects.
> 
> 
> I think it's too late in the beta cycle to be rewriting code unless it's 
> to fix a specific, critical bug. We really, really, *really* need to 
> deliver 3.2 final, and I don't want to mess with working code at this 
> point. The changes you are suggesting are likely more appropriate for 
> 3.3, or possibly a 3.2 bugfix release.
> 
> As far as contributing to mod_python, well, I think you already have the 
> right idea: propose changes; discuss on python-dev; seek consensus; wait 
> for your changes to be committed by myself or Nicolas. Bug fixes and 
> code improvements are much more likely to be accepted than major new 
> features. We want mod_python to be tightly focused on the core 
> functionality of providing a python interface to apache. The fancy bells 
> and whistles are best left to frameworks that utilize mod_python further 
> up the application stack.
> 
> For your specific changes, I would strongly suggest that you make your
> changes against the mod_python development branch, ie trunk. Also people
> are much more likely to test your changes if you submit a patch rather
> than offering a completely new file. This makes it easier to see what 
> code is being changed and avoids problems with regressions.
> 
> I've put together a "Developer's Guide" draft that I'd like to 
> eventually see included in either the documentation or on the website. 
> This document doesn't represent any offical position of the mod_python 
> developers, but rather just what I see as best practices and some 
> resources for contributors. Everyone should feel free to offer 
> suggestions to improve this guide. The original is in svn at:
> http://svn.apache.org/repos/asf/httpd/mod_python/branches/jgallacher/docs/developers-guide.rst 
> 
> 
> And no, I have not spell checked it yet. :)

-- 
Mike Looijmans
Philips Natlab / Topic Automation

Developer Guidelines (was Re: More efficient FieldStorage, StringField and Field classes)

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Mike Looijmans wrote:
> Inspired by our FieldStorage work with file uploads, I also made some 
> implementation changes to Field and StringField in util.py. In essense, 
> that will make things like using form['key'] multiple times much more 
> efficient. Also, FieldStorage will return the same object if called twice.
> 
> I'd like some feedback from the group on these changes, and also some 
> directions on what I'd need to do to get them integrated in the 
> mod_python beta. I have no experience whatsoever in contributing to open 
> source projects.

I think it's too late in the beta cycle to be rewriting code unless it's 
to fix a specific, critical bug. We really, really, *really* need to 
deliver 3.2 final, and I don't want to mess with working code at this 
point. The changes you are suggesting are likely more appropriate for 
3.3, or possibly a 3.2 bugfix release.

As far as contributing to mod_python, well, I think you already have the 
right idea: propose changes; discuss on python-dev; seek consensus; wait 
for your changes to be committed by myself or Nicolas. Bug fixes and 
code improvements are much more likely to be accepted than major new 
features. We want mod_python to be tightly focused on the core 
functionality of providing a python interface to apache. The fancy bells 
and whistles are best left to frameworks that utilize mod_python further 
up the application stack.

For your specific changes, I would strongly suggest that you make your
changes against the mod_python development branch, ie trunk. Also people
are much more likely to test your changes if you submit a patch rather
than offering a completely new file. This makes it easier to see what 
code is being changed and avoids problems with regressions.

I've put together a "Developer's Guide" draft that I'd like to 
eventually see included in either the documentation or on the website. 
This document doesn't represent any offical position of the mod_python 
developers, but rather just what I see as best practices and some 
resources for contributors. Everyone should feel free to offer 
suggestions to improve this guide. The original is in svn at:
http://svn.apache.org/repos/asf/httpd/mod_python/branches/jgallacher/docs/developers-guide.rst

And no, I have not spell checked it yet. :)

Regards,
Jim

More efficient FieldStorage, StringField and Field classes

Posted by Mike Looijmans <nl...@natlab.research.philips.com>.

Inspired by our FieldStorage work with file uploads, I also made some 
implementation changes to Field and StringField in util.py. In essense, 
that will make things like using form['key'] multiple times much more 
efficient. Also, FieldStorage will return the same object if called twice.

I'd like some feedback from the group on these changes, and also some 
directions on what I'd need to do to get them integrated in the 
mod_python beta. I have no experience whatsoever in contributing to open 
source projects.

Major change is the Field and FieldStorage implementation:

class Field:

     filename = None
     headers = {}

     def __init__(self, name):
         self.name = name

     def __repr__(self):
         """Return printable representation."""
         return "Field(%s, %s)" % (`self.name`, `self.value`)

     def __getattr__(self, name):
         if name != 'value':
             raise AttributeError, name
         if self.file:
             self.file.seek(0)
             self.value = self.file.read()
             self.file.seek(0)
         else:
             self.value = None
         return self.value

     def __del__(self):
         self.file.close()

class StringField(str):
     """ This class is basically a string with
     added attributes for compatibility with std lib cgi.py. Basically, this
     works the opposite of Field, as it stores its data in a string, but 
creates
     a file on demand. Field creates a value on demand and stores data 
in a file.
     """
     filename = None
     headers = {}
     ctype = "text/plain"
     type_options = {}
     disposition = None
     disp_options = None

     # I wanted __init__(name, value) but that does not work 
(apparently, you
     # cannot subclass str with a constructor that takes >1 argument)
     def __init__(self, value):
         '''Create StringField instance. You'll have to set name 
yourself.'''
         str.__init__(self, value)
         self.value = value

     def __getattr__(self, name):
         if name != 'file':
             raise AttributeError, name
         self.file = cStringIO.StringIO(self.value)
         return self.file

I've attached a working util.py (based on 3.1.4, so there are also the 
changes to allow large uploads in the file).

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Alexis Marrero <am...@mitre.org>.

Inspired by Mike's changes I made some changes the "new" version to  
improve performance while keeping readability:

def read_to_boundary_new(self, req, boundary, file, readBlockSize):
     previous_delimiter = ''
     bound_length = len(boundary)
     while 1:
         line = req.readline(readBlockSize)
         if line[:bound_length] == boundary:
             break

         if line[-2:] == '\r\n':
             file.write(previous_delimiter + line[:-2])
             previous_delimiter = '\r\n'

         elif line[-1:] == '\r':
             file.write(previous_delimiter + line[:-1])
             previous_delimiter = '\r'

         elif line[-1:] == '\n':
             if len(line[:-1]) > 0:
                 file.write(previous_delimiter + line[:-1])
                 previous_delimiter = '\n'

             else:
                 previous_delimiter += '\n'

         else:
             file.write(previous_delimiter + line)
             previous_delimiter = ''

Results are very comparable to read_to_boundary_mike

/amn
On Nov 8, 2005, at 9:07 AM, Jim Gallacher wrote:

> Mike Looijmans wrote:
>
>> I've attached a modified upload_test_harness.py that includes the  
>> new and current, also the 'org' version (as in 3.1 release) and  
>> the 'mike' version.
>>
>
> Nice changes, Mike.
>
> I started to get confused by the names of the various  
> read_to_boundary_* functions, so I've made a slight modification to  
> your code and renamed these functions.
>
> Note also that there are some failures for the split_boundary file  
> using offset -2 chunk [CR], so I've change the default offset to  
> include this condition.
>
> I've also changed the generate_*file functions to embed an  
> additional '\r' in a short string. Just being paranoid I guess.
>
> Diff is attached.
>
> If it's useful I can stick the upload_test_harness code in the  
> mod_python svn repository.
>
> Jim
>
> --- upload_test_harness_mike.py.orig    2005-11-08  
> 08:02:13.000000000 -0500
> +++ upload_test_harness_mike.py    2005-11-08 08:54:15.000000000 -0500
> @@ -14,8 +14,8 @@
>  ##    f.write('b'*block_size)
>  ##    f.close()
>
> -def read_to_boundary_current(self, req, boundary, file,  
> readBlockSize):
> -    ''' currrent version '''
> +def read_to_boundary_324(self, req, boundary, file, readBlockSize):
> +    ''' currrent version  in 3.2.4b'''
>      #
>      # Although technically possible for the boundary to be split  
> by the read, this will
>      # not happen because the readBlockSize is set quite high - far  
> longer than any boundary line
> @@ -66,7 +66,7 @@
>          else:
>              sline = ''
>
> -def read_to_boundary_new(self, req, boundary, file, readBlockSize):
> +def read_to_boundary_alexis(self, req, boundary, file,  
> readBlockSize):
>      ''' Alexis' version
>          read from the request object line by line with a maximum  
> size,
>          until the new line starts with boundary
> @@ -89,7 +89,7 @@
>              file.write(previous_delimiter + line)
>              previous_delimiter = ''
>
> -def read_to_boundary_org(self, req, boundary, file, readBlockSize):
> +def read_to_boundary_314(self, req, boundary, file, readBlockSize):
>      delim = ""
>      line = req.readline(readBlockSize)
>      while line and not line.startswith(boundary):
> @@ -145,6 +145,8 @@
>
>      f = open(fname, 'wb')
>      f.write('a'*50)
> +    f.write('\r')
> +    f.write('a'*50)
>      f.write('\r\n')
>
>      block_size =  readBlockSize + offset
> @@ -163,6 +165,8 @@
>      """
>      f = open(fname, 'wb')
>      f.write('a'*50)
> +    f.write('\r')
> +    f.write('a'*50)
>      f.write('\r\n')
>
>      block_size =  readBlockSize + offset
> @@ -171,7 +175,7 @@
>
>      f.close()
>
> -read_boundaries = [read_to_boundary_current, read_to_boundary_new,  
> read_to_boundary_org, read_to_boundary_mike]
> +read_boundaries = [read_to_boundary_324, read_to_boundary_alexis,  
> read_to_boundary_314, read_to_boundary_mike]
>
>  def main(file_generator, offset, chunk, block_size=1<<16):
>      fname_in = 'testfile.in'
> @@ -218,6 +222,9 @@
>          pass
>
>  if __name__ == '__main__':
> +    # offsets which show problems are in [-2, -1]
> +    first_offset = -2
> +    last_offset = 0
>
>      #test_chunks =  ['', '\r', '\n', '\r\n']
>
> @@ -229,7 +236,7 @@
>          print '='*40
>          print file_gen_obj.__name__
>          for chunk in test_chunks:
> -            for i in range(-1, 0):
> +            for i in range(first_offset, last_offset):
>                  print '-'*40
>                  print 'test offset', i, 'chunk',[ cname(c) for c  
> in chunk ]
>                  print '-'*40
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Mike Looijmans wrote:
> I've attached a modified upload_test_harness.py that includes the new 
> and current, also the 'org' version (as in 3.1 release) and the 'mike' 
> version.

Nice changes, Mike.

I started to get confused by the names of the various read_to_boundary_* 
functions, so I've made a slight modification to your code and renamed 
these functions.

Note also that there are some failures for the split_boundary file using 
offset -2 chunk [CR], so I've change the default offset to include this 
condition.

I've also changed the generate_*file functions to embed an additional 
'\r' in a short string. Just being paranoid I guess.

Diff is attached.

If it's useful I can stick the upload_test_harness code in the 
mod_python svn repository.

Jim

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Mike Looijmans <nl...@natlab.research.philips.com>.

Alexis Marrero wrote:
> The next test that I will run this against will be with an obscene  
> amount of data for which this improvement helps a lot!

The dumb thing is the checking for boundaries.

I'm using http "chunked" encoding to access a raw TAPE device through 
HTTP with python (it GETs or POSTs the raw data as body, each chunk 
coresponds to a tape block). It blazes the data through at the max 
network speed with hardly any CPU usage. This HTTP upload code uses 100% 
CPU while running on my 3GHz box.

The looking-for-line-ends and mime boundaries method is very inefficient 
compared to that. They oughta have put a "content-length" into every 
chunk header, and we wouldn't have had this problem in the first place.

I think the only realistic way to improve performance is to read the 
client input in binary chunks, and then looking for '\r\n---boundary' 
strings in the chunk using standard string functions. Most of the CPU 
time is now spent in the readline() call.

This also means revising all the mime body parsing to cope with that... 
I doubt if that will be worth the effort for anyone.

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Alexis Marrero <am...@mitre.org>.

Thanks for that improvement, don't like its complexity though.  I'm  
testing "mikes" version with my set of files I will all let you know  
how it goes.

BTW, the line that reads "last_bound = boundary + '--'" so we save 4  
CPU cycles there :)

The next test that I will run this against will be with an obscene  
amount of data for which this improvement helps a lot!

/amn
On Nov 8, 2005, at 4:47 AM, Mike Looijmans wrote:

> Here's one that passes all the tests, and is 2x as fast as the  
> 'current' and 'new' implementations on random binary data. I  
> haven't been able to generate data where the 'mike' version is slower:
>
> def read_to_boundary(self, req, boundary, file, readBlockSize=65536):
>     prevline = ""
>     last_bound = boundary + '--'
>     carry = None
>     while 1:
>         line = req.readline(readBlockSize)
>         if not line or line.startswith(boundary):
>             if prevline.endswith('\r\n'):
>                 if carry is not None:
>                     file.write(carry)
>                 file.write(prevline[:-2])
>                 break
>             elif (carry == '\r') and (prevline[-1] == '\n'):
>                 file.write(prevline[:-1])
>                 break
>             # If we get here, it's not really a boundary!
>         if carry is not None:
>             file.write(carry)
>             carry = None
>         if prevline[-1:] == '\r':
>             file.write(prevline[:-1])
>             carry = '\r'
>         else:
>             file.write(prevline)
>         prevline = line
>
> I've attached a modified upload_test_harness.py that includes the  
> new and current, also the 'org' version (as in 3.1 release) and the  
> 'mike' version.
>
> In addition, I added some profiling calls to show the impact of the  
> extra 'endswith' and slices.
>
> -- 
> Mike Looijmans
> Philips Natlab / Topic Automation
> #!/usr/bin/env python
>
> import sys
> from cStringIO import StringIO
> import md5
>
> ##def generate_split_file(offset=-1,
> ##                          readBlockSize=65368,
> ##                          fname='testfile'):
> ##    f = open(fname, 'wb')
> ##    f.write('a'*50)
> ##    f.write('\r\n')
> ##    block_size =  readBlockSize + offset
> ##    f.write('b'*block_size)
> ##    f.close()
>
> def read_to_boundary_current(self, req, boundary, file,  
> readBlockSize):
>     ''' currrent version '''
>     #
>     # Although technically possible for the boundary to be split by  
> the read, this will
>     # not happen because the readBlockSize is set quite high - far  
> longer than any boundary line
>     # will ever contain.
>     #
>     # lastCharCarried is used to detect the situation where the \r 
> \n is split across the end of
>     # a read block.
>     #
>     delim = ''
>     lastCharCarried = False
>     last_bound = boundary + '--'
>     roughBoundaryLength = len(last_bound) + 128
>     line = req.readline(readBlockSize)
>     lineLength = len(line)
>     if lineLength < roughBoundaryLength:
>         sline = line.strip()
>     else:
>         sline = ''
>     while lineLength > 0 and sline != boundary and sline !=  
> last_bound:
>         if not lastCharCarried:
>             file.write(delim)
>             delim = ''
>         else:
>             lastCharCarried = False
>         cutLength = 0
>
>         if lineLength == readBlockSize:
>             if line[-1:] == '\r':
>                 delim = '\r'
>                 cutLength = -1
>                 lastCharCarried = True
>
>         if line[-2:] == '\r\n':
>             delim += '\r\n'
>             cutLength = -2
>         elif line[-1:] == '\n':
>             delim += '\n'
>             cutLength = -1
>         if cutLength != 0:
>             file.write(line[:cutLength])
>         else:
>             file.write(line)
>
>         line = req.readline(readBlockSize)
>         lineLength = len(line)
>         if lineLength < roughBoundaryLength:
>             sline = line.strip()
>         else:
>             sline = ''
>
> def read_to_boundary_new(self, req, boundary, file, readBlockSize):
>     ''' Alexis' version
>         read from the request object line by line with a maximum size,
>         until the new line starts with boundary
>     '''
>     previous_delimiter = ''
>     while 1:
>         line = req.readline(readBlockSize)
>         if line.startswith(boundary):
>             break
>
>         if line.endswith('\r\n'):
>             file.write(previous_delimiter + line[:-2])
>             previous_delimiter = '\r\n'
>
>         elif line.endswith('\r') or line.endswith('\n'):
>             file.write(previous_delimiter + line[:-1])
>             previous_delimiter = line[-1:]
>
>         else:
>             file.write(previous_delimiter + line)
>             previous_delimiter = ''
>
> def read_to_boundary_org(self, req, boundary, file, readBlockSize):
>     delim = ""
>     line = req.readline(readBlockSize)
>     while line and not line.startswith(boundary):
>         odelim = delim
>         if line[-2:] == "\r\n":
>             delim = "\r\n"
>             line = line[:-2]
>         elif line[-1:] == "\n":
>             delim = "\n"
>             line = line[:-1]
>         else:
>             delim = ""
>         file.write(odelim + line)
>         line = req.readline(readBlockSize)
>
> def read_to_boundary_mike(self, req, boundary, file,  
> readBlockSize=65536):
>     prevline = ""
>     last_bound = boundary + '--'
>     carry = None
>     while 1:
>         line = req.readline(readBlockSize)
>         if not line or line.startswith(boundary):
>             if prevline.endswith('\r\n'):
>                 if carry is not None:
>                     file.write(carry)
>                 file.write(prevline[:-2])
>                 break
>             elif (carry == '\r') and (prevline[-1] == '\n'):
>                 file.write(prevline[:-1])
>                 break
>             # If we get here, it's not really a boundary!
>         if carry is not None:
>             file.write(carry)
>             carry = None
>         if prevline[-1:] == '\r':
>             file.write(prevline[:-1])
>             carry = '\r'
>         else:
>             file.write(prevline)
>         prevline = line
>
> def get_checksum(fname):
>     data = open(fname, 'rb').read()
>     m = md5.new()
>     m.update(data)
>     return m.hexdigest()
>
> def generate_embedded_cr_file(offset=-1, readBlockSize=65368,  
> chunk='\r', fname='testfile'):
>     """ Generate a file which causes the error with file upload
>         The default offset of -1 should generate a file which will
>         be corrupted by the file upload.
>     """
>
>     f = open(fname, 'wb')
>     f.write('a'*50)
>     f.write('\r\n')
>
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.write(chunk)
>     f.write('ccc')
>
>     f.write('d'*50)
>     f.write('\r\n')
>
>     f.close()
>
> def generate_split_boundary_file(offset=-1, readBlockSize=65368,  
> chunk='\r', fname='testfile'):
>     """ this function generates a file with a boundary string '\r 
> \n--myboundary'
>         starting at readBlockSize - offset
>     """
>     f = open(fname, 'wb')
>     f.write('a'*50)
>     f.write('\r\n')
>
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.write(chunk)
>
>     f.close()
>
> read_boundaries = [read_to_boundary_current, read_to_boundary_new,  
> read_to_boundary_org, read_to_boundary_mike]
>
> def main(file_generator, offset, chunk, block_size=1<<16):
>     fname_in = 'testfile.in'
>     fname_out_base = 'testfile.out'
>     file_generator(offset=offset, readBlockSize=block_size,  
> chunk=chunk, fname=fname_in)
>
>     orig_checksum = get_checksum(fname_in)
>
>     req = StringIO()
>     req.write(open(fname_in, 'rb').read())
>     req.write('\r\n--myboundary\r\n')
>
>     src_cs = get_checksum(fname_in)
>     print '     src', src_cs
>
>     for rtb in read_boundaries:
>         name = rtb.__name__.split('_')[-1]
>         fname_out = fname_out_base + name
>         o = file(fname_out, 'wb')
>         req.seek(0)
>         rtb(None, req, '--myboundary', o, block_size)
>         size = o.tell()
>         o.close()
>         cs = get_checksum(fname_out)
>         print "%8s %s %6d" % (name, cs, size),
>         if cs != src_cs:
>             print 'FAIL'
>         else:
>             print 'PASS'
>
>
> def cname(ch):
>     if ch == '\r':
>         return 'CR'
>     elif ch == '\n':
>         return 'LF'
>     elif ch == '':
>         return 'None'
>     else:
>         return ord(ch)
>
> class DevNull:
>     def write(self, data):
>         pass
>
> if __name__ == '__main__':
>
>     #test_chunks =  ['', '\r', '\n', '\r\n']
>
>     # only test the chunks that are currently a problem
>     test_chunks =  ['', '\r',]
>
>     test_cases = (generate_embedded_cr_file,  
> generate_split_boundary_file)
>     for file_gen_obj in test_cases:
>         print '='*40
>         print file_gen_obj.__name__
>         for chunk in test_chunks:
>             for i in range(-1, 0):
>                 print '-'*40
>                 print 'test offset', i, 'chunk',[ cname(c) for c in  
> chunk ]
>                 print '-'*40
>                 main(file_gen_obj, i, chunk)
>                 print
>             print
> ##    sys.exit(0)
>
>     # Run profiling benchmarks with the four files
>     # (on my system, the _mike and _org versions are typically almost
>     # twice as fast as the _new and _current)
>     import profile, random
>     req = StringIO()
>     blocksize = 102400
>     block=[]
>     r = random.Random()
>     for i in xrange(blocksize):
>         block.append(chr(r.randint(0,255)))
>     block = "".join(block)
>     for c in range(64):
>       req.write(block)
>     req.write('\r\n--myboundary\r\n')
>     print "Benchmark bytes:", req.tell()
>     out = DevNull()
>     for rtb in read_boundaries:
>         req.seek(0)
>         print rtb.__name__
>         profile.run("%s(None, req, '--myboundary', out, 65536)" %  
> rtb.__name__)
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Mike Looijmans <nl...@natlab.research.philips.com>.

Here's one that passes all the tests, and is 2x as fast as the 'current' 
and 'new' implementations on random binary data. I haven't been able to 
generate data where the 'mike' version is slower:

def read_to_boundary(self, req, boundary, file, readBlockSize=65536):
     prevline = ""
     last_bound = boundary + '--'
     carry = None
     while 1:
         line = req.readline(readBlockSize)
         if not line or line.startswith(boundary):
             if prevline.endswith('\r\n'):
                 if carry is not None:
                     file.write(carry)
                 file.write(prevline[:-2])
                 break
             elif (carry == '\r') and (prevline[-1] == '\n'):
                 file.write(prevline[:-1])
                 break
             # If we get here, it's not really a boundary!
         if carry is not None:
             file.write(carry)
             carry = None
         if prevline[-1:] == '\r':
             file.write(prevline[:-1])
             carry = '\r'
         else:
             file.write(prevline)
         prevline = line

I've attached a modified upload_test_harness.py that includes the new 
and current, also the 'org' version (as in 3.1 release) and the 'mike' 
version.

In addition, I added some profiling calls to show the impact of the 
extra 'endswith' and slices.

-- 
Mike Looijmans
Philips Natlab / Topic Automation

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Mike Looijmans <nl...@natlab.research.philips.com>.

What i don't like at all in this implementation is the large amount of 
memcpy operations.
1. line.strip()
2. line[:-x]
3. previous_delimiter + ...
The average pass will perform between two and three memcopy operations 
on the read block.

Suggestion: Loose the strip() call - it serves no purpose (the boundary 
will be the only thing on that line, otherwise it's not a boundary)

I've built one without any memcpy whatsoever, I've tried it with the 
test harnass and it appears equal to the Alexis version (pass all except 
the generate_split_boundary_file test):

def read_to_boundary(self, req, boundary, file, readBlockSize=65536):
     prevline = ""
     while 1:
         line = req.readline(readBlockSize)
         if not line or line.startswith(boundary):
             if prevline.endswith('\r\n'):
                 file.write(prevline[:-2])
             elif prevline.endswith('\n'):
                 file.write(prevline[:-1])
             break
         file.write(prevline)
         prevline = line






Alexis Marrero wrote:
 > New version of read_to_boundary(...)
 >
 > readBlockSize = 1 << 16
 > def read_to_boundary(self, req, boundary, file):
 >     previous_delimiter = ''
 >     while 1:
 >         line = req.readline(readBlockSize)
 >         if line.strip().startswith(boundary):
 >             break
 >
 >         if line.endswith('\r\n'):
 >             file.write(previous_delimiter + line[:-2])
 >             previous_delimiter = '\r\n'
 >
 >         elif line.endswith('\r'):
 >             file.write(previous_delimiter + line[:-1])
 >             previous_delimiter = '\r'
 >
 >         elif line.endswith('\n'):
 >             if len(line[:-1]) > 0:
 >                 file.write(previous_delimiter + line[:-1])
 >                 previous_delimiter = '\n'
 >
 >             else:
 >                 previous_delimiter += '\n'
 >
 >         else:
 >             file.write(previous_delimiter + line)
 >             previous_delimiter = ''
 >
 > This new functions passes the test for Jim's filetest generator.


-- 
Mike Looijmans
Philips Natlab / Topic Automation

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Alexis Marrero <am...@mitre.org>.

New version of read_to_boundary(...)

readBlockSize = 1 << 16
def read_to_boundary(self, req, boundary, file):
     previous_delimiter = ''
     while 1:
         line = req.readline(readBlockSize)
         if line.strip().startswith(boundary):
             break

         if line.endswith('\r\n'):
             file.write(previous_delimiter + line[:-2])
             previous_delimiter = '\r\n'

         elif line.endswith('\r'):
             file.write(previous_delimiter + line[:-1])
             previous_delimiter = '\r'

         elif line.endswith('\n'):
             if len(line[:-1]) > 0:
                 file.write(previous_delimiter + line[:-1])
                 previous_delimiter = '\n'

             else:
                 previous_delimiter += '\n'

         else:
             file.write(previous_delimiter + line)
             previous_delimiter = ''

This new functions passes the test for Jim's filetest generator.

[core:~] amarrero% python filegenerator.py ; md5 testfile ; cp  
testfile testfile.new ; python test.py ; md5 test.bin
MD5 (testfile) = d481990a0f0bbd8acf847cd732714555
MD5 (outputtestfile.bin) = d481990a0f0bbd8acf847cd732714555

I'm running a test with my set of files (60000+) to see any other new  
issues.



On Nov 7, 2005, at 6:35 PM, Jim Gallacher wrote:

> Alexis Marrero wrote:
>
>> Ok. Now I'm confused.
>>
>
> So am I!
>
> I've created a test harness so we can bypass mod_python completely.  
> It includes a slightly modified version of read_to_boundary which  
> adds a new parameter, readBlockSize.
>
> In the output from the test harness, your version is 'new' and the  
> current version is 'cur'. Run it and see what happens.
>
> Jim
>
> $ ./upload_test_harness
>
> ========================================
> generate_embedded_cr_file
> ----------------------------------------
> test offset -1 chunk []
> ----------------------------------------
> src 5a63347d1106afdfa264b2a61f81ae82
> cur 5a63347d1106afdfa264b2a61f81ae82 PASS
> new 5a63347d1106afdfa264b2a61f81ae82 PASS
>
> ----------------------------------------
> test offset -1 chunk ['CR']
> ----------------------------------------
> src 82204e52343d5b25c2e783cd59499973
> cur e4af2eee73029642a114697ba59217b3 FAIL
> new 82204e52343d5b25c2e783cd59499973 PASS
>
> ========================================
> generate_split_boundary_file
> ----------------------------------------
> test offset -1 chunk []
> ----------------------------------------
> src d481990a0f0bbd8acf847cd732714555
> cur d481990a0f0bbd8acf847cd732714555 PASS
> new 8fa5ac9f913d778575ea871506c392a9 FAIL
>
> ----------------------------------------
> test offset -1 chunk ['CR']
> ----------------------------------------
> src 8fa5ac9f913d778575ea871506c392a9
> cur d481990a0f0bbd8acf847cd732714555 FAIL
> new 8fa5ac9f913d778575ea871506c392a9 PASS
>
>
>
>> What I was trying to say is that I created a file with this function:
>> def generate_split_file(offset=-1,
>>                           readBlockSize=65368,
>>                           fname='testfile'):
>>     f = open(fname, 'w')
>>     f.write('a'*50)
>>     f.write('\r\n')
>>     block_size =  readBlockSize + offset
>>     f.write('b'*block_size)
>>     f.close()
>> Then I uploaded 'testfile' using the following   
>> StorageField.read_to_boundary() method:
>> def read_to_boundary(self, req, boundary, file):
>>     ''' read from the request object line by line with a maximum  
>> size,
>>         until the new line starts with boundary
>>     '''
>>     previous_delimiter = ''
>>     while 1:
>>         line = req.readline(1<<16)
>>         if line.startswith(boundary):
>>             break
>>         if line.endswith('\r\n'):
>>             file.write(previous_delimiter + line[:-2])
>>             previous_delimiter = '\r\n'
>>         elif line.endswith('\r') or line.endswith('\n'):
>>             file.write(previous_delimiter + line[:-1])
>>             previous_delimiter = line[-1:]
>>         else:
>>             file.write(previous_delimiter + line)
>>             previous_delimiter = ''
>> And the md5 in the client is the same one as in the server.  Do  
>> you  have different results? Let me know.
>> Regards,
>> /amn
>> On Nov 7, 2005, at 2:11 PM, Jim Gallacher wrote:
>>
>>> Jim Gallacher wrote:
>>>
>>>
>>>> Alexis Marrero wrote:
>>>>
>>>>
>>>>> Jim,
>>>>> Thanks for sending the function that creates the test file.   
>>>>> However I ran it to create the test file, and after uploading  
>>>>> the  file the MD5 still the same.
>>>>>
>>>>>
>>>
>>> Just to clarify, is this for your new read_to_boundary or the  
>>> one  in 3.2? If it's for yours then the MD5 sum *should* be the  
>>> same,  since that's what you fixed. :)
>>>
>>>
>>>
>>>> Did you call it with the same block size as you are using in  
>>>> your  code? The '\r' character must appear in the file right at  
>>>> the  readBlockSize boundary.
>>>> ie.
>>>> generate_file(offset=-1, readBlockSize=1<<16, fname='testfile')
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>
> #!/usr/bin/env python
>
> from mkfile import generate_split_file, generate_file
> import sys
> from StringIO import StringIO
> import md5
>
> def read_to_boundary_current(self, req, boundary, file,  
> readBlockSize):
>     ''' currrent version '''
>     #
>     # Although technically possible for the boundary to be split by  
> the read, this will
>     # not happen because the readBlockSize is set quite high - far  
> longer than any boundary line
>     # will ever contain.
>     #
>     # lastCharCarried is used to detect the situation where the \r 
> \n is split across the end of
>     # a read block.
>     #
>     delim = ''
>     lastCharCarried = False
>     last_bound = boundary + '--'
>     roughBoundaryLength = len(last_bound) + 128
>     line = req.readline(readBlockSize)
>     lineLength = len(line)
>     if lineLength < roughBoundaryLength:
>         sline = line.strip()
>     else:
>         sline = ''
>     while lineLength > 0 and sline != boundary and sline !=  
> last_bound:
>         if not lastCharCarried:
>             file.write(delim)
>             delim = ''
>         else:
>             lastCharCarried = False
>         cutLength = 0
>
>         if lineLength == readBlockSize:
>             if line[-1:] == '\r':
>                 delim = '\r'
>                 cutLength = -1
>                 lastCharCarried = True
>
>         if line[-2:] == '\r\n':
>             delim += '\r\n'
>             cutLength = -2
>         elif line[-1:] == '\n':
>             delim += '\n'
>             cutLength = -1
>         if cutLength != 0:
>             file.write(line[:cutLength])
>         else:
>             file.write(line)
>
>         line = req.readline(readBlockSize)
>         lineLength = len(line)
>         if lineLength < roughBoundaryLength:
>             sline = line.strip()
>         else:
>             sline = ''
>
> def read_to_boundary_new(self, req, boundary, file, readBlockSize):
>     ''' Alexis' version
>         read from the request object line by line with a maximum size,
>         until the new line starts with boundary
>     '''
>     previous_delimiter = ''
>     while 1:
>         line = req.readline(readBlockSize)
>         if line.startswith(boundary):
>             break
>
>         if line.endswith('\r\n'):
>             file.write(previous_delimiter + line[:-2])
>             previous_delimiter = '\r\n'
>
>         elif line.endswith('\r') or line.endswith('\n'):
>             file.write(previous_delimiter + line[:-1])
>             previous_delimiter = line[-1:]
>
>         else:
>             file.write(previous_delimiter + line)
>             previous_delimiter = ''
>
> def get_checksum(fname):
>     data = open(fname).read()
>     m = md5.new()
>     m.update(data)
>     return m.hexdigest()
>
> def generate_embedded_cr_file(offset=-1, readBlockSize=65368,  
> chunk='\r', fname='testfile'):
>     """ Generate a file which causes the error with file upload
>         The default offset of -1 should generate a file which will
>         be corrupted by the file upload.
>     """
>
>     f = open(fname, 'w')
>     f.write('a'*50)
>     f.write('\r\n')
>
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.write(chunk)
>     f.write('ccc')
>
>     f.write('d'*50)
>     f.write('\r\n')
>
>     f.close()
>
> def generate_split_boundary_file(offset=-1, readBlockSize=65368,  
> chunk='\r', fname='testfile'):
>     """ this function generates a file with a boundary string '\r 
> \n--myboundary'
>         starting at readBlockSize - offset
>     """
>     f = open(fname, 'w')
>     f.write('a'*50)
>     f.write('\r\n')
>
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.write(chunk)
>
>     f.close()
>
> def main(file_generator, offset, chunk, block_size=1<<16):
>     fname_in = 'testfile.in'
>     fname_out = 'testfile.out'
>     file_generator(offset=offset, readBlockSize=block_size,  
> chunk=chunk, fname=fname_in)
>     orig_checksum = get_checksum(fname_in)
>
>     req = StringIO()
>     req.write(open(fname_in).read())
>     req.write('\r\n--myboundary\r\n')
>
>     src_cs = get_checksum(fname_in)
>     print 'src', src_cs
>
>     fname_out = '%s.cur' % fname_out
>     o = file(fname_out, 'wb')
>     req.seek(0)
>     read_to_boundary_current(None, req, '--myboundary', o, block_size)
>     o.close()
>     cs = get_checksum(fname_out)
>     print 'cur', cs,
>     if cs != src_cs:
>         print 'FAIL'
>     else:
>         print 'PASS'
>
>     fname_out = '%s.alexis' % fname_out
>     o = file(fname_out, 'wb')
>     req.seek(0)
>     read_to_boundary_new(None, req, '--myboundary', o, block_size)
>     o.close()
>     cs = get_checksum(fname_out)
>     print 'new', cs,
>     if cs != src_cs:
>         print 'FAIL'
>     else:
>         print 'PASS'
>
> def cname(ch):
>     if ch == '\r':
>         return 'CR'
>     elif ch == '\n':
>         return 'LF'
>     elif ch == '':
>         return 'None'
>     else:
>         return ord(ch)
>
> if __name__ == '__main__':
>
>     #test_chunks =  ['', '\r', '\n', '\r\n']
>
>     # only test the chunks that are currently a problem
>     test_chunks =  ['', '\r',]
>
>     test_cases =  
> {'generate_embedded_cr_file':generate_embedded_cr_file,  
> 'generate_split_boundary_file': generate_split_boundary_file, }
>     for name,file_gen_obj in test_cases.items():
>         print '='*40
>         print name
>         for chunk in test_chunks:
>             for i in range(-1, 0):
>                 print '-'*40
>                 print 'test offset', i, 'chunk',[ cname(c) for c in  
> chunk ]
>                 print '-'*40
>                 main(file_gen_obj, i, chunk)
>                 print
>             print
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Alexis Marrero wrote:
> Ok. Now I'm confused.

So am I!

I've created a test harness so we can bypass mod_python completely. It 
includes a slightly modified version of read_to_boundary which adds a 
new parameter, readBlockSize.

In the output from the test harness, your version is 'new' and the 
current version is 'cur'. Run it and see what happens.

Jim

$ ./upload_test_harness

========================================
generate_embedded_cr_file
----------------------------------------
test offset -1 chunk []
----------------------------------------
src 5a63347d1106afdfa264b2a61f81ae82
cur 5a63347d1106afdfa264b2a61f81ae82 PASS
new 5a63347d1106afdfa264b2a61f81ae82 PASS

----------------------------------------
test offset -1 chunk ['CR']
----------------------------------------
src 82204e52343d5b25c2e783cd59499973
cur e4af2eee73029642a114697ba59217b3 FAIL
new 82204e52343d5b25c2e783cd59499973 PASS

========================================
generate_split_boundary_file
----------------------------------------
test offset -1 chunk []
----------------------------------------
src d481990a0f0bbd8acf847cd732714555
cur d481990a0f0bbd8acf847cd732714555 PASS
new 8fa5ac9f913d778575ea871506c392a9 FAIL

----------------------------------------
test offset -1 chunk ['CR']
----------------------------------------
src 8fa5ac9f913d778575ea871506c392a9
cur d481990a0f0bbd8acf847cd732714555 FAIL
new 8fa5ac9f913d778575ea871506c392a9 PASS


> What I was trying to say is that I created a file with this function:
> 
> def generate_split_file(offset=-1,
>                           readBlockSize=65368,
>                           fname='testfile'):
>     f = open(fname, 'w')
>     f.write('a'*50)
>     f.write('\r\n')
>     block_size =  readBlockSize + offset
>     f.write('b'*block_size)
>     f.close()
> 
> Then I uploaded 'testfile' using the following  
> StorageField.read_to_boundary() method:
> 
> def read_to_boundary(self, req, boundary, file):
>     ''' read from the request object line by line with a maximum size,
>         until the new line starts with boundary
>     '''
>     previous_delimiter = ''
>     while 1:
>         line = req.readline(1<<16)
>         if line.startswith(boundary):
>             break
> 
>         if line.endswith('\r\n'):
>             file.write(previous_delimiter + line[:-2])
>             previous_delimiter = '\r\n'
> 
>         elif line.endswith('\r') or line.endswith('\n'):
>             file.write(previous_delimiter + line[:-1])
>             previous_delimiter = line[-1:]
> 
>         else:
>             file.write(previous_delimiter + line)
>             previous_delimiter = ''
> 
> And the md5 in the client is the same one as in the server.  Do you  
> have different results? Let me know.
> 
> Regards,
> /amn
> 
> On Nov 7, 2005, at 2:11 PM, Jim Gallacher wrote:
> 
>> Jim Gallacher wrote:
>>
>>> Alexis Marrero wrote:
>>>
>>>> Jim,
>>>> Thanks for sending the function that creates the test file.  However 
>>>> I ran it to create the test file, and after uploading the  file the 
>>>> MD5 still the same.
>>>>
>>
>> Just to clarify, is this for your new read_to_boundary or the one  in 
>> 3.2? If it's for yours then the MD5 sum *should* be the same,  since 
>> that's what you fixed. :)
>>
>>
>>> Did you call it with the same block size as you are using in your  
>>> code? The '\r' character must appear in the file right at the  
>>> readBlockSize boundary.
>>> ie.
>>> generate_file(offset=-1, readBlockSize=1<<16, fname='testfile')
>>>
>>
>>
>>
>>
> 
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Alexis Marrero <am...@mitre.org>.

Ok. Now I'm confused.

What I was trying to say is that I created a file with this function:

def generate_split_file(offset=-1,
                           readBlockSize=65368,
                           fname='testfile'):
     f = open(fname, 'w')
     f.write('a'*50)
     f.write('\r\n')
     block_size =  readBlockSize + offset
     f.write('b'*block_size)
     f.close()

Then I uploaded 'testfile' using the following  
StorageField.read_to_boundary() method:

def read_to_boundary(self, req, boundary, file):
     ''' read from the request object line by line with a maximum size,
         until the new line starts with boundary
     '''
     previous_delimiter = ''
     while 1:
         line = req.readline(1<<16)
         if line.startswith(boundary):
             break

         if line.endswith('\r\n'):
             file.write(previous_delimiter + line[:-2])
             previous_delimiter = '\r\n'

         elif line.endswith('\r') or line.endswith('\n'):
             file.write(previous_delimiter + line[:-1])
             previous_delimiter = line[-1:]

         else:
             file.write(previous_delimiter + line)
             previous_delimiter = ''

And the md5 in the client is the same one as in the server.  Do you  
have different results? Let me know.

Regards,
/amn

On Nov 7, 2005, at 2:11 PM, Jim Gallacher wrote:

> Jim Gallacher wrote:
>
>> Alexis Marrero wrote:
>>
>>> Jim,
>>> Thanks for sending the function that creates the test file.  
>>> However I ran it to create the test file, and after uploading the  
>>> file the MD5 still the same.
>>>
>
> Just to clarify, is this for your new read_to_boundary or the one  
> in 3.2? If it's for yours then the MD5 sum *should* be the same,  
> since that's what you fixed. :)
>
>
>> Did you call it with the same block size as you are using in your  
>> code? The '\r' character must appear in the file right at the  
>> readBlockSize boundary.
>> ie.
>> generate_file(offset=-1, readBlockSize=1<<16, fname='testfile')
>>
>
>
>
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Jim Gallacher wrote:
> Alexis Marrero wrote:
> 
>> Jim,
>> Thanks for sending the function that creates the test file. However I 
>> ran it to create the test file, and after uploading the file the MD5 
>> still the same.

Just to clarify, is this for your new read_to_boundary or the one in 
3.2? If it's for yours then the MD5 sum *should* be the same, since 
that's what you fixed. :)

> Did you call it with the same block size as you are using in your code? 
> The '\r' character must appear in the file right at the readBlockSize 
> boundary.
> 
> ie.
> 
> generate_file(offset=-1, readBlockSize=1<<16, fname='testfile')
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Alexis Marrero wrote:
> Sorry for all this emails, 

No worries. It's a bug that needs to be fixed, so your work will benefit 
everyone. :)

Jim

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Alexis Marrero <am...@mitre.org>.

Sorry for all this emails, but my system depends 100% on mod_python  
specially file uploading. :)

On Nov 7, 2005, at 2:04 PM, Jim Gallacher wrote:

> Alexis Marrero wrote:
>
>> Jim,
>> Nicolas,
>> Thanks for sending the function that creates the test file.  
>> However I ran it to create the test file, and after uploading the  
>> file the MD5 still the same.
>>
>
> Did you call it with the same block size as you are using in your  
> code? The '\r' character must appear in the file right at the  
> readBlockSize boundary.
YES I ran it with generate_file(offset=-1, readBlockSize=1<<16,  
fname='testfile')
and the MD5 of the input and output files match.

>
> ie.
>
> generate_file(offset=-1, readBlockSize=1<<16, fname='testfile')
>
> Jim
>
> PS. Please make sure you cc. python-dev when replying. It's best to  
> keep discussion on the list.
>
>
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Alexis Marrero wrote:
> Jim,
> 
> Nicolas,
> 
> Thanks for sending the function that creates the test file. However I 
> ran it to create the test file, and after uploading the file the MD5 
> still the same.

Did you call it with the same block size as you are using in your code? 
The '\r' character must appear in the file right at the readBlockSize 
boundary.

ie.

generate_file(offset=-1, readBlockSize=1<<16, fname='testfile')

Jim

PS. Please make sure you cc. python-dev when replying. It's best to keep 
discussion on the list.

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Nicolas Lehuen wrote:
> Well, I've re-read the previous code and it looks like it does almost 
> the same thing except it is bugged :). CherryPy's implementation is 
> almost the same except it ought to work.
> 
> Jim, I've integrated your tricky file into the unit test. Alexis' 
> version passes all tests, whereas the current version fails on your file 
> and ugh.pdf. So I guess we can call it a day and integrate Alexis' fix.
> 
> One thing which is different is the use of 1<<16 = 65536 as read block 
> size instead of 65368. It was Barry Pearce who contributed the latest 
> version of read_to_boundary, he must have known why this block size was 
> good (or was it his lucky number ?). Anyway, I've changed Alexis' code 
> to use the readBlockSize variable and changed its value to 1<<16. I've 
> done tests with both 1<<16 and 65638 on the server side and on the 
> client side (changing the tricky file generation), and the new code works.

There is a failure mode in Alexis' code. If the final line in the file 
has length == readBlockSize - 1, the boundary (eg. '\r\n--myboundary') 
gets split after the '\r'. The read terminates properly, but a stray 
'\r' gets appended to the file. Interestingly, the original code 
mentions that this is unlikely to happen, but seems to handle it anyway.

This bit of code will generate a file which will cause a problem for 
Alexis' read_to_boundary.

def generate_split_file(offset=-1,
                           readBlockSize=65368,
                           fname='testfile'):
     f = open(fname, 'w')
     f.write('a'*50)
     f.write('\r\n')
     block_size =  readBlockSize + offset
     f.write('b'*block_size)
     f.close()

I'll write a new unit test to cover this situation.

I'm also thinking that the unit test should not hard code the 
readBlockSize right in the test. It would be better to have a global 
variable at the top of test.py, with a note in both test.py and util.py 
that it's value must be the same in both files. Otherwise, if the value 
of readBlockSize changes in util.py, we may end up with regressions that 
the unit test will not catch.

As far as Nicolas' test_fileupload unit test is concerned I'd be 
inclined to remove the test for ugh.pdf. I'm pretty confident that the 
simulated 'tricky' file is an adequate test, plus we can't be sure that 
ugh.pdf will not change in the future.

Alexis mentioned that he had 5 pdf files that were corrupted. If the 
other files are publicly accessible I could take a look and see if they 
are failing the same way, with a '\r' character right at a readBlockSize 
boundary point. Alternatively, I could polish my test script and he can 
run it against his files. If we do that then I'm confident we can 
generate a synthetic file for testing and not depend on ugh.pdf

Jim

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Nicolas Lehuen <ni...@gmail.com>.

Well, I've re-read the previous code and it looks like it does almost the
same thing except it is bugged :). CherryPy's implementation is almost the
same except it ought to work.

Jim, I've integrated your tricky file into the unit test. Alexis' version
passes all tests, whereas the current version fails on your file and ugh.pdf.
So I guess we can call it a day and integrate Alexis' fix.

One thing which is different is the use of 1<<16 = 65536 as read block size
instead of 65368. It was Barry Pearce who contributed the latest version of
read_to_boundary, he must have known why this block size was good (or was it
his lucky number ?). Anyway, I've changed Alexis' code to use the
readBlockSize variable and changed its value to 1<<16. I've done tests with
both 1<<16 and 65638 on the server side and on the client side (changing the
tricky file generation), and the new code works.

What we should have done, while we were at spending a lot of time on a
trivial string search problem, is at least to implement a fast Boyer-Moore
search algorithm, and drop the use of readline. But I guess this will be for
3.4 ;).

And yes, this 3 spaces indent is dreadful.

Regards,
Nicolas

2005/11/7, Jim Gallacher <jpg@jgassociates.ca >:
>
> Gregory (Grisha) Trubetskoy wrote:
> >
> > So I guess this means we roll and vote on a 3.2.5b?
> >
>
> As much as it pains me to say it, but yes, this is a must fixm so it's
> on to 3.2.5b.
>
> I think we need to do some more extensive testing on Alexis's fix before
> we roll 3.2.5b. His read_to_boundary is much simpler than the current
> one. This makes me wonder if there is some magic happening in the
> current version which is solving some weird problems, or is his code
> just that much better?
>
> I'm feeling a little dull right now so all the code just blurs together.
> It also doesn't help that util.py uses *3 spaces* for the indent. Yikes.
> How the heck did that happen? :(
>
> I'll take another look tomorrow.
>
> Jim
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Gregory (Grisha) Trubetskoy wrote:
> 
> So I guess this means we roll and vote on a 3.2.5b?
> 

As much as it pains me to say it, but yes, this is a must fixm so it's 
on to 3.2.5b.

I think we need to do some more extensive testing on Alexis's fix before 
we roll 3.2.5b. His read_to_boundary is much simpler than the current 
one. This makes me wonder if there is some magic happening in the 
current version which is solving some weird problems, or is his code 
just that much better?

I'm feeling a little dull right now so all the code just blurs together. 
It also doesn't help that util.py uses *3 spaces* for the indent. Yikes. 
How the heck did that happen? :(

I'll take another look tomorrow.

Jim

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by "Gregory (Grisha) Trubetskoy" <gr...@apache.org>.

So I guess this means we roll and vote on a 3.2.5b?

Grisah

On Sun, 6 Nov 2005, Nicolas Lehuen wrote:

> OK, it looks like Alexis' fix solves the problem with ugh.pdf without
> breaking the other unit tests. So I think we can safely integrate his patch.
> Shall I do it ?
>
> Regards,
> Nicolas

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Alexis Marrero <am...@mitre.org>.

Nicolas,

Not that I'm the one to give permission whether to integrate things  
or not, but just to let you know I don't even have svn installed so I  
won't do it. At least not for a while...

BTW, if there are some cherrypy developers in this mailing list, the  
CherryPy function that handles file uploads DOES also has the same  
issue.

I'm not subscribe to CherryPy dev group thus the cross posting.

 From http://svn.cherrypy.org/trunk/cherrypy/_cpcgifs.py

     24      def read_lines_to_outerboundary(self):
     25          """Internal: read lines until outerboundary."""
     26          next = "--" + self.outerboundary
     27          last = next + "--"
     28          delim = ""
     29          last_line_lfend = True
     30          while 1:
     31              line = self.fp.readline(1<<16)
     32              if not line:
     33                  self.done = -1
     34                  break
     35              if line[:2] == "--" and last_line_lfend:
     36                  strippedline = line.strip()
     37                  if strippedline == next:
     38                      break
     39                  if strippedline == last:
     40                      self.done = 1
     41                      break
     42              odelim = delim
     43              if line[-2:] == "\r\n":
     44                  delim = "\r\n"
     45                  line = line[:-2]
     46                  last_line_lfend = True
     47              elif line[-1] == "\n":
     48                  delim = "\n"
     49                  line = line[:-1]
     50                  last_line_lfend = True
     51              else:
     52                  delim = ""
     53                  last_line_lfend = False
     54              self.__write(odelim + line)

/amn

On Nov 6, 2005, at 4:20 PM, Nicolas Lehuen wrote:

> OK, it looks like Alexis' fix solves the problem with ugh.pdf  
> without breaking the other unit tests. So I think we can safely  
> integrate his patch. Shall I do it ?
>
> Regards,
> Nicolas
>
> 2005/11/6, Nicolas Lehuen <ni...@gmail.com>:
> Hi guys,
>
> In the pure "if it ain't tested, it ain't fixed" fashion, I've  
> added a unit test for file upload to the test suite. It uploads a  
> randomly generated 1 MB file to the server, and check that the MD5  
> digest returned by the server is correct. I could not reproduce  
> Alexis' bug report this way, however. But I then I added a test  
> with the UNIX-HATERS handbook file ugh.pdf, and bang, here comes  
> the bug.
>
> I've checked in both unit tests into subversion, so that you can  
> play with them. I'm now going to test Alexis' fix.
>
> Regards,
> Nicolas
>
> 2005/11/6, Alexis Marrero <am...@mitre.org>:
> I don't have a function that creates the files but the I can point
> you to a file that has the problem, ironically is "Unix Haters
> Handbook" :) Well, at least is not the Python HH....
>
> http://research.microsoft.com/~daniel/uhh-download.html
>
> It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
> ends up been after upload with read_to_boundary().
>
> When you use the function to copy the file you will see that the
> digest will be e45979254297b0ece9c182a789d7966e.
>
> I have other 5 files out of 78546 files that I'm testing it against
> that have the same issues, coincidentally there are all PDF files.
> Here is the script that I was testing it with.
>
> def read_to_boundary(self, req, boundary, file):
>      ''' read from the request object line by line with a maximum  
> size,
>          until the new line starts with boundary
>      '''
>      previous_delimiter = ''
>      while 1:
>          line = req.readline(1<<16)
>          if line.startswith(boundary):
>              break
>
>          if line.endswith('\r\n'):
>              file.write(previous_delimiter + line[:-2])
>              previous_delimiter = '\r\n'
>
>          elif line.endswith('\r') or line.endswith('\n'):
>              file.write(previous_delimiter + line[:-1])
>              previous_delimiter = line[-1:]
>
>          else:
>              file.write(previous_delimiter + line)
>              previous_delimiter = ''
>
> #f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
> #f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
> Pages Manuale Utente.pdf', 'a+')
> f = file('ugh.pdf.new', 'a+')
> f.write('\r\n--myboundary--\r\n')
> f.seek(0)
> o = file('test.bin', 'wb')
> read_to_boundary(None, f, '--myboundary', o)
> o.close()
>
> /amn
>
>
> On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:
>
> > Alexis,
> >
> > I wanted to add that I'm testing your code.
> >
> > Alexis Marrero wrote:
> >
> >> Let me know any comments on it and if you test it and fails please
> >> also let me know. I don't have subversion account neither I don't
> >> know how to use it thus this email.
> >>
> >
> > You don't need an account to use subversion anonymously. Just
> > install subversion and grab a mod_python working copy.
> >
> > $ svn co http://svn.apache.org/repos/asf/httpd/mod_python/trunk  
> trunk
> >
> > This will checkout a working copy into a new directory called
> > "trunk" on  your machine. All of the following commands assume you
> > are working in trunk/.
> >
> > Make your changes in your working copy, and then create a diff with:
> >
> > $ svn diff lib/python/mod_python/util.py > your-patch.diff
> >
> > The other commands which you'll find immediately useful are:
> >
> > svn update    - update your working copy from the repository
> > svn status    - shows status of changes in your working copy
> > svn -u status    - shows status of your copy against the repository
> >
> > I've found "Version Control with Subverion" is an excellent
> > resource and is available online.
> > http://svnbook.red-bean.com/en/1.1/index.html
> >
> > Jim
> >
>
>
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Nicolas Lehuen <ni...@gmail.com>.

OK, it looks like Alexis' fix solves the problem with ugh.pdf without
breaking the other unit tests. So I think we can safely integrate his patch.
Shall I do it ?

Regards,
Nicolas

2005/11/6, Nicolas Lehuen <ni...@gmail.com>:
>
> Hi guys,
>
> In the pure "if it ain't tested, it ain't fixed" fashion, I've added a
> unit test for file upload to the test suite. It uploads a randomly generated
> 1 MB file to the server, and check that the MD5 digest returned by the
> server is correct. I could not reproduce Alexis' bug report this way,
> however. But I then I added a test with the UNIX-HATERS handbook file
> ugh.pdf, and bang, here comes the bug.
>
> I've checked in both unit tests into subversion, so that you can play with
> them. I'm now going to test Alexis' fix.
>
> Regards,
> Nicolas
>
> 2005/11/6, Alexis Marrero <am...@mitre.org>:
> >
> > I don't have a function that creates the files but the I can point
> > you to a file that has the problem, ironically is "Unix Haters
> > Handbook" :) Well, at least is not the Python HH....
> >
> > http://research.microsoft.com/~daniel/uhh-download.html<http://research.microsoft.com/%7Edaniel/uhh-download.html>
> >
> > It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
> > ends up been after upload with read_to_boundary().
> >
> > When you use the function to copy the file you will see that the
> > digest will be e45979254297b0ece9c182a789d7966e.
> >
> > I have other 5 files out of 78546 files that I'm testing it against
> > that have the same issues, coincidentally there are all PDF files.
> > Here is the script that I was testing it with.
> >
> > def read_to_boundary(self, req, boundary, file):
> > ''' read from the request object line by line with a maximum size,
> > until the new line starts with boundary
> > '''
> > previous_delimiter = ''
> > while 1:
> > line = req.readline(1<<16)
> > if line.startswith(boundary):
> > break
> >
> > if line.endswith('\r\n'):
> > file.write(previous_delimiter + line[:-2])
> > previous_delimiter = '\r\n'
> >
> > elif line.endswith('\r') or line.endswith('\n'):
> > file.write(previous_delimiter + line[:-1])
> > previous_delimiter = line[-1:]
> >
> > else:
> > file.write(previous_delimiter + line)
> > previous_delimiter = ''
> >
> > #f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
> > #f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
> > Pages Manuale Utente.pdf', 'a+')
> > f = file('ugh.pdf.new', 'a+')
> > f.write('\r\n--myboundary--\r\n')
> > f.seek(0)
> > o = file('test.bin', 'wb')
> > read_to_boundary(None, f, '--myboundary', o)
> > o.close()
> >
> > /amn
> >
> >
> > On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:
> >
> > > Alexis,
> > >
> > > I wanted to add that I'm testing your code.
> > >
> > > Alexis Marrero wrote:
> > >
> > >> Let me know any comments on it and if you test it and fails please
> > >> also let me know. I don't have subversion account neither I don't
> > >> know how to use it thus this email.
> > >>
> > >
> > > You don't need an account to use subversion anonymously. Just
> > > install subversion and grab a mod_python working copy.
> > >
> > > $ svn co http://svn.apache.org/repos/asf/httpd/mod_python/trunk trunk
> > >
> > > This will checkout a working copy into a new directory called
> > > "trunk" on your machine. All of the following commands assume you
> > > are working in trunk/.
> > >
> > > Make your changes in your working copy, and then create a diff with:
> > >
> > > $ svn diff lib/python/mod_python/util.py > your-patch.diff
> > >
> > > The other commands which you'll find immediately useful are:
> > >
> > > svn update - update your working copy from the repository
> > > svn status - shows status of changes in your working copy
> > > svn -u status - shows status of your copy against the repository
> > >
> > > I've found "Version Control with Subverion" is an excellent
> > > resource and is available online.
> > > http://svnbook.red-bean.com/en/1.1/index.html
> > >
> > > Jim
> > >
> >
> >
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Nicolas Lehuen <ni...@gmail.com>.

Hi guys,

In the pure "if it ain't tested, it ain't fixed" fashion, I've added a unit
test for file upload to the test suite. It uploads a randomly generated 1 MB
file to the server, and check that the MD5 digest returned by the server is
correct. I could not reproduce Alexis' bug report this way, however. But I
then I added a test with the UNIX-HATERS handbook file ugh.pdf, and bang,
here comes the bug.

I've checked in both unit tests into subversion, so that you can play with
them. I'm now going to test Alexis' fix.

Regards,
Nicolas

2005/11/6, Alexis Marrero <am...@mitre.org>:
>
> I don't have a function that creates the files but the I can point
> you to a file that has the problem, ironically is "Unix Haters
> Handbook" :) Well, at least is not the Python HH....
>
> http://research.microsoft.com/~daniel/uhh-download.html
>
> It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it
> ends up been after upload with read_to_boundary().
>
> When you use the function to copy the file you will see that the
> digest will be e45979254297b0ece9c182a789d7966e.
>
> I have other 5 files out of 78546 files that I'm testing it against
> that have the same issues, coincidentally there are all PDF files.
> Here is the script that I was testing it with.
>
> def read_to_boundary(self, req, boundary, file):
> ''' read from the request object line by line with a maximum size,
> until the new line starts with boundary
> '''
> previous_delimiter = ''
> while 1:
> line = req.readline(1<<16)
> if line.startswith(boundary):
> break
>
> if line.endswith('\r\n'):
> file.write(previous_delimiter + line[:-2])
> previous_delimiter = '\r\n'
>
> elif line.endswith('\r') or line.endswith('\n'):
> file.write(previous_delimiter + line[:-1])
> previous_delimiter = line[-1:]
>
> else:
> file.write(previous_delimiter + line)
> previous_delimiter = ''
>
> #f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
> #f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/
> Pages Manuale Utente.pdf', 'a+')
> f = file('ugh.pdf.new', 'a+')
> f.write('\r\n--myboundary--\r\n')
> f.seek(0)
> o = file('test.bin', 'wb')
> read_to_boundary(None, f, '--myboundary', o)
> o.close()
>
> /amn
>
>
> On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:
>
> > Alexis,
> >
> > I wanted to add that I'm testing your code.
> >
> > Alexis Marrero wrote:
> >
> >> Let me know any comments on it and if you test it and fails please
> >> also let me know. I don't have subversion account neither I don't
> >> know how to use it thus this email.
> >>
> >
> > You don't need an account to use subversion anonymously. Just
> > install subversion and grab a mod_python working copy.
> >
> > $ svn co http://svn.apache.org/repos/asf/httpd/mod_python/trunk trunk
> >
> > This will checkout a working copy into a new directory called
> > "trunk" on your machine. All of the following commands assume you
> > are working in trunk/.
> >
> > Make your changes in your working copy, and then create a diff with:
> >
> > $ svn diff lib/python/mod_python/util.py > your-patch.diff
> >
> > The other commands which you'll find immediately useful are:
> >
> > svn update - update your working copy from the repository
> > svn status - shows status of changes in your working copy
> > svn -u status - shows status of your copy against the repository
> >
> > I've found "Version Control with Subverion" is an excellent
> > resource and is available online.
> > http://svnbook.red-bean.com/en/1.1/index.html
> >
> > Jim
> >
>
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Alexis Marrero <am...@mitre.org>.

I don't have a function that creates the files but the I can point  
you to a file that has the problem, ironically is "Unix Haters  
Handbook" :) Well, at least is not the Python HH....

http://research.microsoft.com/~daniel/uhh-download.html

It's MD5 is 9e8c42be55aac825e7a34d448044d0fe. I don't know what it  
ends up been after upload with read_to_boundary().

When you use the function to copy the file you will see that the  
digest will be e45979254297b0ece9c182a789d7966e.

I have other 5 files out of 78546 files that I'm testing it against  
that have the same issues, coincidentally there are all PDF files.  
Here is the script that I was testing it with.

def read_to_boundary(self, req, boundary, file):
     ''' read from the request object line by line with a maximum size,
         until the new line starts with boundary
     '''
     previous_delimiter = ''
     while 1:
         line = req.readline(1<<16)
         if line.startswith(boundary):
             break

         if line.endswith('\r\n'):
             file.write(previous_delimiter + line[:-2])
             previous_delimiter = '\r\n'

         elif line.endswith('\r') or line.endswith('\n'):
             file.write(previous_delimiter + line[:-1])
             previous_delimiter = line[-1:]

         else:
             file.write(previous_delimiter + line)
             previous_delimiter = ''

#f = file('Perl Bookshelf [4th Ed]/mre/final/ch06.pdf.new', 'a+')
#f = file('Pages User Guide.app/Contents/Resources/Italian.lproj/ 
Pages Manuale Utente.pdf', 'a+')
f = file('ugh.pdf.new', 'a+')
f.write('\r\n--myboundary--\r\n')
f.seek(0)
o = file('test.bin', 'wb')
read_to_boundary(None, f, '--myboundary', o)
o.close()

/amn


On Nov 6, 2005, at 11:58 AM, Jim Gallacher wrote:

> Alexis,
>
> I wanted to add that I'm testing your code.
>
> Alexis Marrero wrote:
>
>> Let me know any comments on it and if you test it and fails please  
>> also let me know. I don't have subversion account neither I don't  
>> know how to use it thus this email.
>>
>
> You don't need an account to use subversion anonymously. Just  
> install subversion and grab a mod_python working copy.
>
> $ svn co http://svn.apache.org/repos/asf/httpd/mod_python/trunk trunk
>
> This will checkout a working copy into a new directory called  
> "trunk" on  your machine. All of the following commands assume you  
> are working in trunk/.
>
> Make your changes in your working copy, and then create a diff with:
>
> $ svn diff lib/python/mod_python/util.py > your-patch.diff
>
> The other commands which you'll find immediately useful are:
>
> svn update    - update your working copy from the repository
> svn status    - shows status of changes in your working copy
> svn -u status    - shows status of your copy against the repository
>
> I've found "Version Control with Subverion" is an excellent  
> resource and is available online.
> http://svnbook.red-bean.com/en/1.1/index.html
>
> Jim
>

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Alexis,

I wanted to add that I'm testing your code.

Alexis Marrero wrote:
> 
> Let me know any comments on it and if you test it and fails please also 
> let me know. I don't have subversion account neither I don't know how to 
> use it thus this email.

You don't need an account to use subversion anonymously. Just install 
subversion and grab a mod_python working copy.

$ svn co http://svn.apache.org/repos/asf/httpd/mod_python/trunk trunk

This will checkout a working copy into a new directory called "trunk" on 
  your machine. All of the following commands assume you are working in 
trunk/.

Make your changes in your working copy, and then create a diff with:

$ svn diff lib/python/mod_python/util.py > your-patch.diff

The other commands which you'll find immediately useful are:

svn update	- update your working copy from the repository
svn status	- shows status of changes in your working copy
svn -u status	- shows status of your copy against the repository

I've found "Version Control with Subverion" is an excellent resource and 
is available online.
http://svnbook.red-bean.com/en/1.1/index.html

Jim

Re: mod_python.util.StorageField.read_to_boundary has problems in 3.1 and 3.2

Posted by Jim Gallacher <jp...@jgassociates.ca>.

Alexis,

Do you a have a small file which shows this behaviour and could be used 
for testing? Even better would be a function which would generate a test 
file. This could be included in the mod_python unit tests.

Jim


Alexis Marrero wrote:
> All, 
> 
> The current 3.1 mod_python implementation of 
> mod_python.util.StorageField.read_to_boudary reads as follows:
> 
>    203      def read_to_boundary(self, req, boundary, file):
>    204          delim = ""
>    205          line = req.readline()
>    206          sline = line.strip()
>    207          last_bound = boundary + "--"
>    208          while line and sline != boundary and sline != last_bound:
>    209              odelim = delim
>    210              if line[-2:] == "\r\n":
>    211                  delim = "\r\n"
>    212                  line = line[:-2]
>    213              elif line[-1:] == "\n":
>    214                  delim = "\n"
>    215                  line = line[:-1]
>    216              file.write(odelim + line)
>    217              line = req.readline()
>    218              sline = line.strip()
> 
> As we have discussed previously: 
> http://www.modpython.org/pipermail/mod_python/2005-March/017754.html
> http://www.modpython.org/pipermail/mod_python/2005-March/017756.html
> http://www.modpython.org/pipermail/mod_python/2005-November/019460.html
> 
> This triggered couple of changes in mod_python 3.2 Beta which reads as 
> follows:
>     33  # Fixes memory error when upload large files such as 700+MB ISOs.
>     34  readBlockSize = 65368
>     35
> *...*
>    225     def read_to_boundary(self, req, boundary, file):
> ...
>    234         delim = ''
>    235         lastCharCarried = False
>    236         last_bound = boundary + '--'
>    237         roughBoundaryLength = len(last_bound) + 128
>    238         line = req.readline(readBlockSize)
>    239         lineLength = len(line)
>    240         if lineLength < roughBoundaryLength:
>    241             sline = line.strip()
>    242         else:
>    243             sline = ''
>    244         while lineLength > 0 and sline != boundary and sline != 
> last_bound:
>    245             if not lastCharCarried:
>    246                 file.write(delim)
>    247                 delim = ''
>    248             else:
>    249                 lastCharCarried = False
>    250             cutLength = 0
>    251             if lineLength == readBlockSize:
>    252                 if line[-1:] == '\r':
>    253                     delim = '\r'
>    254                     cutLength = -1
>    255                     lastCharCarried = True
>    256             if line[-2:] == '\r\n':
>    257                 delim += '\r\n'
>    258                 cutLength = -2
>    259             elif line[-1:] == '\n':
>    260                 delim += '\n'
>    261                 cutLength = -1
>    262             if cutLength != 0:
>    263                 file.write(line[:cutLength])
>    264             else:
>    265                 file.write(line)
>    266             line = req.readline(readBlockSize)
>    267             lineLength = len(line)
>    268             if lineLength < roughBoundaryLength:
>    269                 sline = line.strip()
>    270             else:
>    271                 sline = ''
> 
> This function has a mysterious bug in it... For some files which I could 
> disclose (one of them been the PDF file for Apple's Pages User Manual in 
> Italian) the uploaded file in the server ends up with the same length 
> but different sha512 (the only digest that I'm using).  The problem is a 
> '\r' in the middle of a chunk of data that is much larger 
> than readBlockSize.
> 
> Anyhow, I wrote a new function, which I believe is much simpler, and 
> test it with thousands and thousands of different files and so far it 
> seems to work fine.  It reads as follows:
> 
> def read_to_boundary(self, req, boundary, file):
>     ''' read from the request object line by line with a maximum size,
>         until the new line starts with boundary
>     '''
>     previous_delimiter = ''
>     while 1:
>         line = req.readline(1<<16)
>         if line.startswith(boundary):
>             break
>        
>         if line.endswith('\r\n'):
>             file.write(previous_delimiter + line[:-2])
>             previous_delimiter = '\r\n'
>        
>         elif line.endswith('\r') or line.endswith('\n'):
>             file.write(previous_delimiter + line[:-1])  
>             previous_delimiter = line[-1:]
> 
>         else:
>             file.write(previous_delimiter + line)
>             previous_delimiter = ''
> 
> Let me know any comments on it and if you test it and fails please also 
> let me know. I don't have subversion account neither I don't know how to 
> use it thus this email.
> 
> /amn
> 
> _______________________________________________
> Mod_python mailing list
> Mod_python@modpython.org <ma...@modpython.org>
> http://mailman.modpython.org/mailman/listinfo/mod_python
> 
>