You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mod_python-dev@quetz.apache.org by Nicolas Lehuen <ni...@gmail.com> on 2005/04/30 10:11:16 UTC

mod_python.publisher : proposal for a few implementation changes

Hi,

I'm trying to solve both MODPYTHON-15
(http://issues.apache.org/jira/browse/MODPYTHON-15) and MODPYTHON-16
(http://issues.apache.org/jira/browse/MODPYTHON-16) in one strike...

What do you think of this approach : the object returned by
resolve_object is then passed to this publish_object function :

# This regular expression is used to test for the presence of an HTML header
# tag, written in upper or lower case.
re_html = re.compile(r"<HTML",re.I)

def publish_object(req, object):
    if callable(object):
        req.form = util.FieldStorage(req, keep_blank_values=1)
        return publish_object(req,util.apply_fs_data(object, req.form, req=req))
    elif hasattr(object,'__iter__'):
        result = False
        for result in object:
            result |= publish_object(req,object)
        return result
    else:
        if object is None:
            return False
        elif isinstance(object,UnicodeType):
            # TODO : this skips all encoding issues, which is VERY BAD
            # I don't even understand how the req.write below can work !
            result = object
        else:
            result = str(object)
            
        if not req._content_type_set:
            # make an attempt to guess content-type
            if re_html.search(result,0,100):
                req.content_type = 'text/html'
            else:
                req.content_type = 'text/plain'
        
        if req.method!='HEAD':
            req.write(result)

        return True


This way, we could support classes, class instances and iterables
(amongst which generators) as possible return values. The boolean
return value tells the handler whether something was effectively
published, it is used by the handler as such :

    if (not publish_object(req, object)) and (req.bytes_sent==0) and
(req.next is None):
        req.log_error("mod_python.publisher: nothing to publish.")
        return apache.HTTP_INTERNAL_SERVER_ERROR
    else:
        return apache.OK

See the enclosed file for the whole thing. Comments are of course welcome.

Regards,

Nicolas

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

A few issues, but I will tackle them one at a time in separate emails 
as I get
to them. :-)

On 30/04/2005, at 6:11 PM, Nicolas Lehuen wrote:

> # This regular expression is used to test for the presence of an HTML 
> header
> # tag, written in upper or lower case.
> re_html = re.compile(r"<HTML",re.I)

This isn't a reliable way of determining if content is HTML. Previously 
the test
was:

             if result[:100].strip()[:6].lower() == '<html>' \
                or result.find('</') > 0:

You have dropped the requirement for the closing '>' on the 'html' 
element,
which is probably a good thing to do, but you also dropped the check 
for '</'
anywhere in the content. Dropping this latter check will cause stuff 
that
was detected as HTML before, not to be detected now as HTML.

As an example, consider the start of the HTML from www.w3.org web site:

   <?xml version="1.0" encoding="utf-8"?>
   <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
   <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" 
lang="en-US">
   ...
   </html>

The presence of the XML and DOCTYPE declarations puts the actual start 
of the
'html' tag at about character position 150. This is beyond the 100 
characters
that the code checks for the start of a 'html' element.

This would also have failed the check as it existed in original code, 
but in
the original code the check for '</' anywhere in the content would have 
then
kicked in and would have been matched at some point.

I am not saying that the original code is any better because of the 
potential
performance issues of scanning the full content in the worst case, but 
it did
work where proposed code could fail.

What might be better is to search backwards through the final part 
(maybe 100
characters) of the content for the string '</html', rather than 
searching from
the start. This is because all the DOCTYPE declarations mean there can 
be a
lot of leading crap, whereas don't think that there can be anything 
validly
trailing the final closing of the 'html' element.

Graham

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

On 30/04/2005, at 6:37 PM, Nicolas Lehuen wrote:
>         if object is None:
>             return False

I really don't understand why a published function returning None 
should result
in a HTTP 500 internal server error response if no output content had 
been
explicitly written back via the request object. At least this is what 
currently
happens and am assuming that the above code is continuing that 
behaviour.

In "vampire::publisher" if a published function returns None or if 
access is
to a data variable which is set to None, I generate an empty string as a
response. It didn't make sense to me to be generating a HTTP 500 error, 
it
just seemed to confuse newbies who couldn't work out what they were 
doing
wrong. I figured they would learn quicker from getting an empty response
instead of a cryptic HTTP 500 error.

And yes, I am aware that you have since added logging of a message to 
indicate:

   req.log_error("mod_python.publisher: %s returned nothing." % `object`)

Newbies however don't necessarily go looking in the log. ;-)

Graham

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

On 30/04/2005, at 10:24 PM, Nicolas Lehuen wrote:
> So, all this relies on the default platform encoding. How nice. The
> reason why you don't find sys.setdefaultencoding() is because this
> method is deleted from the module after the module is loaded,
> presumably to prevent developers to change the default encoding on the
> fly. I remember being mad at Python when I first discovered that (I
> was trying to remove this dumb 'ascii' default encoding).

It would only be removed though after the Python site setup file has 
been
executed. Ie., if I remember correctly, you can change the default 
setting
by adding a call to sys.setdefaultencoding() in the site file.

To me this suggests even more that adding a feature to mod_python which
would allow a directive in the Apache configuration file to set the
default encoding would be a good idea. As I said, it would have to be
at same scope as PythonImport and done when the interpreter is first
created but before any code is executed within the scope of the
interpreter.

If this was provided, it would certainly be a lot easier than having to
add something to the Python site file. Not even sure where that is.

Python FAQ has the following to say about this sort of stuff:

   It's possible to set a default encoding in a file called 
sitecustomize.py
   that's part of the Python library. However, this isn't recommended 
because
   changing the Python-wide default encoding may cause third-party 
extension
   modules to fail.

How much grief could we cause for third party modules by playing with 
it?
Remember that because it would be set by mod_python, only affecting 
stuff
running under mod_python and haven't destroyed anything else running on
the system.

Graham

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Nicolas Lehuen <ni...@gmail.com>.

Wow. I'm working on another project right now which involves a C++
core with Python and Java mappings (using SWIG). I've just got
confused and assumed that the Java Native Interface behaviour of
exchanging string data in UTF8 format was also found in Python. Sorry.

So, all this relies on the default platform encoding. How nice. The
reason why you don't find sys.setdefaultencoding() is because this
method is deleted from the module after the module is loaded,
presumably to prevent developers to change the default encoding on the
fly. I remember being mad at Python when I first discovered that (I
was trying to remove this dumb 'ascii' default encoding).

This is one more reason NOT to let the system handle the writing of
unicode strings on the request output stream. The server's default
encoding could be any encoding, and there is no guarantee that this
encoding is good for the content you want to send. My example about
French accentuated still holds ; that's simple, if I want to return
u'café' on my computer, I get this :

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 3: ordinal not in range(128)

It's not very useful to be able to return unicode strings if the only
codepoints that are allowed are those that have a mapping in ASCII...

So we might as well drop the Unicode support and tell the developer to
handle the encoding himself, OR extract the desired encoding from the
Content-Type header and handle the encoding in the publisher.

Regards,

Nicolas

On 4/30/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
> 
> On 30/04/2005, at 9:40 PM, Nicolas Lehuen wrote:
> 
> > Graham, the encoding used by PyArg_ParseTuple is indeed UTF-8, whereas
> > str(unicode_string) uses the default encoding of the platform Python
> > is running on, which is unpredictable (for example, for years now
> > under win32 it has been ASCII even though there are ways to get the
> > default encoding specific to the current setup ; I suspect the
> > situation is not better on other platforms).
> >
> > Thus, if we removed the check for UnicodeType and simply did result =
> > str(object) for unicode string, we would have runtime exceptions,
> > because if the string contains accents, under win32, the default
> > encoder (ascii) will complain that it does not know how to encode
> > them.
> >
> > I'd rather have the developer choose explicitely the encoding he
> > wishes to use, with a default to UTF8, through the content-type
> > header.
> 
> Hmmm, getting confusing. :-(
> 
> The code says:
> 
>      if (encoding == NULL)
>          encoding = PyUnicode_GetDefaultEncoding();
> 
>      /* Shortcuts for common default encodings */
>      if (errors == NULL) {
>          if (strcmp(encoding, "utf-8") == 0)
>              return PyUnicode_AsUTF8String(unicode);
>          else if (strcmp(encoding, "latin-1") == 0)
>              return PyUnicode_AsLatin1String(unicode);
> #if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
>          else if (strcmp(encoding, "mbcs") == 0)
>              return PyUnicode_AsMBCSString(unicode);
> #endif
>          else if (strcmp(encoding, "ascii") == 0)
>              return PyUnicode_AsASCIIString(unicode);
>      }
> 
>      /* Encode via the codec registry */
>      v = PyCodec_Encode(unicode, encoding, errors);
> 
> Thus default doesn't seem to be UTF-8 but is what ever the default
> encoding is as would be used by str().
> 
> Maybe mod_python should have an Apache configuration file option which
> allows you to set the default encoding. Internally it could call:
> 
>    PyUnicode_SetDefaultEncoding()
> 
> The option would only be able to be set outside of any <Directory> or
> other directives. Ie., same level as PythonImport. If the option is
> not set, mod_python could forcibly set it to something which makes
> more sense in a web environment and would cause less problems. For
> example, could set it to "UTF-8" if that works better.
> 
> Only thing I am not sure about is at what version of Python this
> function was introduced. Am a bit confused that my Python 2.3 on
> Mac OS X doesn't have sys.setdefaultencoding() yet in the Python 2.3.4
> source code I have, it is present. I presume that the underlying C
> function would still be there though.
> 
> Graham
> 
>

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

On 30/04/2005, at 9:40 PM, Nicolas Lehuen wrote:

> Graham, the encoding used by PyArg_ParseTuple is indeed UTF-8, whereas
> str(unicode_string) uses the default encoding of the platform Python
> is running on, which is unpredictable (for example, for years now
> under win32 it has been ASCII even though there are ways to get the
> default encoding specific to the current setup ; I suspect the
> situation is not better on other platforms).
>
> Thus, if we removed the check for UnicodeType and simply did result =
> str(object) for unicode string, we would have runtime exceptions,
> because if the string contains accents, under win32, the default
> encoder (ascii) will complain that it does not know how to encode
> them.
>
> I'd rather have the developer choose explicitely the encoding he
> wishes to use, with a default to UTF8, through the content-type
> header.

Hmmm, getting confusing. :-(

The code says:

     if (encoding == NULL)
         encoding = PyUnicode_GetDefaultEncoding();

     /* Shortcuts for common default encodings */
     if (errors == NULL) {
         if (strcmp(encoding, "utf-8") == 0)
             return PyUnicode_AsUTF8String(unicode);
         else if (strcmp(encoding, "latin-1") == 0)
             return PyUnicode_AsLatin1String(unicode);
#if defined(MS_WINDOWS) && defined(HAVE_USABLE_WCHAR_T)
         else if (strcmp(encoding, "mbcs") == 0)
             return PyUnicode_AsMBCSString(unicode);
#endif
         else if (strcmp(encoding, "ascii") == 0)
             return PyUnicode_AsASCIIString(unicode);
     }

     /* Encode via the codec registry */
     v = PyCodec_Encode(unicode, encoding, errors);

Thus default doesn't seem to be UTF-8 but is what ever the default
encoding is as would be used by str().

Maybe mod_python should have an Apache configuration file option which
allows you to set the default encoding. Internally it could call:

   PyUnicode_SetDefaultEncoding()

The option would only be able to be set outside of any <Directory> or
other directives. Ie., same level as PythonImport. If the option is
not set, mod_python could forcibly set it to something which makes
more sense in a web environment and would cause less problems. For
example, could set it to "UTF-8" if that works better.

Only thing I am not sure about is at what version of Python this
function was introduced. Am a bit confused that my Python 2.3 on
Mac OS X doesn't have sys.setdefaultencoding() yet in the Python 2.3.4
source code I have, it is present. I presume that the underlying C
function would still be there though.

Graham

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Nicolas Lehuen <ni...@gmail.com>.

Graham, the encoding used by PyArg_ParseTuple is indeed UTF-8, whereas
str(unicode_string) uses the default encoding of the platform Python
is running on, which is unpredictable (for example, for years now
under win32 it has been ASCII even though there are ways to get the
default encoding specific to the current setup ; I suspect the
situation is not better on other platforms).

Thus, if we removed the check for UnicodeType and simply did result =
str(object) for unicode string, we would have runtime exceptions,
because if the string contains accents, under win32, the default
encoder (ascii) will complain that it does not know how to encode
them.

I'd rather have the developer choose explicitely the encoding he
wishes to use, with a default to UTF8, through the content-type
header.

Regards,
Nicolas

On 4/30/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
> 
> On 30/04/2005, at 6:37 PM, Nicolas Lehuen wrote:
> >         elif isinstance(object,UnicodeType):
> >             # TODO : this skips all encoding issues, which is VERY BAD
> >             # I don't even understand how the req.write below can work
> > !
> >             result = object
> >         else:
> >             result = str(object)
> 
> What do you see is the issue that required an explicit check for
> UnicodeType
> and avoidance of converting it with str().
> 
> As the code is above, req.write() will be called with the
> UnicodeObject. This
> will work provided that the Unicode string can be converted into a
> normal
> string using the default encoding. Ie., in underlying C code
> PyArg_ParseTuple
> will use "s", meaning:
> 
> "s" (string or Unicode object) [char *]
>    Convert a Python string or Unicode object to a C pointer to a
> character
>    string. You must not provide storage for the string itself; a pointer
>    to an existing string is stored into the character pointer variable
>    whose address you pass. The C string is null-terminated. The Python
>    string must not contain embedded null bytes; if it does, a TypeError
>    exception is raised. Unicode objects are converted to C strings using
>    the default encoding. If this conversion fails, an UnicodeError is
> raised.
> 
> I think though that applying str() in the Python code to the Unicode
> string
> probably yields the same result. Ie., str(u'123') results in encode()
> method
> of Unicode string object being called.
> 
> S.encode([encoding[,errors]]) -> string
> 
> Return an encoded string version of S. Default encoding is the current
> default string encoding. errors may be given to set a different error
> handling scheme. Default is 'strict' meaning that encoding errors raise
> a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
> 'xmlcharrefreplace' as well as any other name registered with
> codecs.register_error that can handle UnicodeEncodeErrors.
> 
> In other words, I don't believe there is any difference between
> converting
> it using str() before the call to req.write() as there is passing
> Unicode
> string direct to req.write(). Thus, explicit check for UnicodeType
> probably
> not required.
> 
> Graham
> 
>

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

On 30/04/2005, at 9:48 PM, Nicolas Lehuen wrote:
>
> Mmmh. It seems that __str__() should only return str instances. At
> least that's how I understand it from the fact that there is a
> __unicode__ special method in the object class (from the Python
> documentation) :
>
> __unicode__( self)
> Called to implement unicode() builtin; should return a Unicode object.
> When this method is not defined, string conversion is attempted, and
> the result of string conversion is converted to Unicode using the
> system default encoding.

This is just opening up a bigger can of worms. Where we currently have:

   result = str(object)

should it instead be:

   result = unicode(object)

If one was going to properly make the whole system Unicode capable, this
is probably what you would do. If an object doesn't define __unicode__()
it will call __str__() anyway and then convert that to a Unicode string.

At the moment, if something defines __unicode__() it gets completely
ignored.

Maybe this whole area of Unicode support should also be deferred to the
release after this one, although I might have a play with it in my
"vampire::publisher" code if I get a chance. :-)

Graham

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Nicolas Lehuen <ni...@gmail.com>.

On 4/30/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
> You must be getting sick of me picking apart everything you put up. :-)

Well, no, it's exactly why I submit ideas to the list :)

> Problem I see with this is that it wouldn't be getting applied in all
> cases where Unicode strings would be getting returned. Imagine:
> 
>    class _Object:
>      def __str__(self):
>        return u'123'
> 
>    object = _Object()
> 
> The "object" variable isn't a Unicode string and if "object" is
> accessed,
> then str() gets applied to it and __str__() will return a Unicode
> string.
> This therefore bypasses your attempt to convert it using the appropriate
> encoding.

Mmmh. It seems that __str__() should only return str instances. At
least that's how I understand it from the fact that there is a
__unicode__ special method in the object class (from the Python
documentation) :

__unicode__( self) 
Called to implement unicode() builtin; should return a Unicode object.
When this method is not defined, string conversion is attempted, and
the result of string conversion is converted to Unicode using the
system default encoding.

That's a tricky one, I think we should ask the Python community about this.

> Interesting that in this case the Unicode string also gets delivered
> direct
> to req.write() as well.
> 
> I would suggest that the whole encoding issue be left up to the
> developer
> to handle rather than trying to be smart about it and make it automatic.
> The developer is going to know what they want, where as we would be
> making
> assumptions and could get it wrong.
> 
> Graham

Why not, but we should forbid people from returning Unicode strings,
then, because they rely on an undocumented behaviour of the publisher
that could change later.

Regards,
Nicolas

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

You must be getting sick of me picking apart everything you put up. :-)

Problem I see with this is that it wouldn't be getting applied in all
cases where Unicode strings would be getting returned. Imagine:

   class _Object:
     def __str__(self):
       return u'123'

   object = _Object()

The "object" variable isn't a Unicode string and if "object" is 
accessed,
then str() gets applied to it and __str__() will return a Unicode 
string.
This therefore bypasses your attempt to convert it using the appropriate
encoding.

Interesting that in this case the Unicode string also gets delivered 
direct
to req.write() as well.

I would suggest that the whole encoding issue be left up to the 
developer
to handle rather than trying to be smart about it and make it automatic.
The developer is going to know what they want, where as we would be 
making
assumptions and could get it wrong.

Graham

On 30/04/2005, at 9:28 PM, Nicolas Lehuen wrote:

> I think in this case the default conversion used is UTF8. Ideally, a
> developer returning Unicode strings from functions should have a way
> to decide in what encoding (UTF-8, iso-latin-1, etc.) the string
> should be returned to the client.
>
> One possible way to do that would be to parse the content-type header,
> i.e. if the developer set the content type header to "text/html;
> charset=iso-8859-1", then we know the developer expect the result to
> be encoded in iso-8859-1, so we can do result =
> object.encode('iso-8859-1').
>
> Here is some tentative code for this :
>
> re_charset = re.compile(r"charset\s*=\s*([^\s;]+)");
>
> def publish_object(req, object):
>     if callable(object):
>         req.form = util.FieldStorage(req, keep_blank_values=1)
>         return publish_object(req,util.apply_fs_data(object, req.form, 
> req=req))
>     elif hasattr(object,'__iter__'):
>         result = False
>         for item in object:
>             result |= publish_object(req,item)
>         return result
>     else:
>         if object is None:
>             return False
>         elif isinstance(object,UnicodeType):
>             # We try to detect the character encoding
>             # from the Content-Type header
>             if req._content_type_set:
>                 charset = re_charset.search(req.content_type)
>                 if charset:
>                     charset = charset.group(1)
>                 else:
>                     charset = 'UTF8'
>                     req.content_type += '; charset=UTF8'
>             else:
>                 charset = 'UTF8'
>
>             result = object.encode(charset)
>         else:
>             result = str(object)
>
>     [...]
>
> Regards,
> Nicolas
>
>
> On 4/30/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
>>
>> On 30/04/2005, at 6:37 PM, Nicolas Lehuen wrote:
>>>         elif isinstance(object,UnicodeType):
>>>             # TODO : this skips all encoding issues, which is VERY 
>>> BAD
>>>             # I don't even understand how the req.write below can 
>>> work
>>> !
>>>             result = object
>>>         else:
>>>             result = str(object)
>>
>> What do you see is the issue that required an explicit check for
>> UnicodeType
>> and avoidance of converting it with str().
>>
>> As the code is above, req.write() will be called with the
>> UnicodeObject. This
>> will work provided that the Unicode string can be converted into a
>> normal
>> string using the default encoding. Ie., in underlying C code
>> PyArg_ParseTuple
>> will use "s", meaning:
>>
>> "s" (string or Unicode object) [char *]
>>    Convert a Python string or Unicode object to a C pointer to a
>> character
>>    string. You must not provide storage for the string itself; a 
>> pointer
>>    to an existing string is stored into the character pointer variable
>>    whose address you pass. The C string is null-terminated. The Python
>>    string must not contain embedded null bytes; if it does, a 
>> TypeError
>>    exception is raised. Unicode objects are converted to C strings 
>> using
>>    the default encoding. If this conversion fails, an UnicodeError is
>> raised.
>>
>> I think though that applying str() in the Python code to the Unicode
>> string
>> probably yields the same result. Ie., str(u'123') results in encode()
>> method
>> of Unicode string object being called.
>>
>> S.encode([encoding[,errors]]) -> string
>>
>> Return an encoded string version of S. Default encoding is the current
>> default string encoding. errors may be given to set a different error
>> handling scheme. Default is 'strict' meaning that encoding errors 
>> raise
>> a UnicodeEncodeError. Other possible values are 'ignore', 'replace' 
>> and
>> 'xmlcharrefreplace' as well as any other name registered with
>> codecs.register_error that can handle UnicodeEncodeErrors.
>>
>> In other words, I don't believe there is any difference between
>> converting
>> it using str() before the call to req.write() as there is passing
>> Unicode
>> string direct to req.write(). Thus, explicit check for UnicodeType
>> probably
>> not required.
>>
>> Graham
>>
>>

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

On 01/05/2005, at 9:07 PM, Nicolas Lehuen wrote:

> Graham, have you seen the standard inspect.getargspec() function ? It
> exists since Python 2.1 and may save some code and portability :
>
> http://www.python.org/doc/2.2.3/lib/inspect-classes-functions.html

Yes. I probably didn't use it out of habit. Specifically, when I wrote 
my
other Python project where I needed to interrogate arguments, I had to 
be
compatible with Python 1.5 and Python 2.0, so initially didn't exist and
then couldn't use it for compatibility reasons.

Possibly about time to start using it, although not sure it will change
much the amount of code that needs to be written as still need special
cases to drop off self parameter of class instance method. It also does
not eliminate the chain of if statements which identify the actual 
object
you need to apply the interrogation to. :-)

Graham

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Nicolas Lehuen <ni...@gmail.com>.

Graham, have you seen the standard inspect.getargspec() function ? It
exists since Python 2.1 and may save some code and portability :

http://www.python.org/doc/2.2.3/lib/inspect-classes-functions.html

Regards,
Nicolas

On 5/1/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
> 
> On 01/05/2005, at 8:05 PM, Nicolas Lehuen wrote:
> 
> > Here is util.py with the modified line. I've tested it and it works.
> > The only problem is that it forces the developer to write a __init__
> > method, even though it only contains "pass".
> 
> You can work around the requirement to have "__init__()", you just need
> to
> restructure the code a bit and move interrogation of co_flags into the
> actual if statement. I actually have a distinct routine for calculating
> args and other stuff. See code for it attached.
> 
> 
> 
> 
> 
> 
>

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

On 01/05/2005, at 8:05 PM, Nicolas Lehuen wrote:

> Here is util.py with the modified line. I've tested it and it works.
> The only problem is that it forces the developer to write a __init__
> method, even though it only contains "pass".

You can work around the requirement to have "__init__()", you just need 
to
restructure the code a bit and move interrogation of co_flags into the
actual if statement. I actually have a distinct routine for calculating
args and other stuff. See code for it attached.

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Nicolas Lehuen <ni...@gmail.com>.

Here is util.py with the modified line. I've tested it and it works.
The only problem is that it forces the developer to write a __init__
method, even though it only contains "pass".

I've modified the regular expression for the HTML closing tag in the
new branch. BTW, it was a bit scary to make this branch, at one point
I made a mistake and thought I had messed up the entire Apache
repository. Turns out I don't have this much power, fortunately ;).

If you want to check out this branch, here it is :

https://svn.apache.org/repos/asf/httpd/mod_python/branches/3.2.0-experimental-publisher

Regards,

Nicolas

On 5/1/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
> 
> On 01/05/2005, at 7:32 PM, Nicolas Lehuen wrote:
> > I just had to modify apply_fs_data in util.py to allow new-style
> > classes to be called (changing the test type(object) = ClassType to
> > type(object) in (TypeType, ClassType)), and to rewrite publisher.py as
> > enclosed.
> 
> Can you post your modified apply_fs_data. Want to see if what you came
> up with is similar to mine or whether one or the other of us forgot some
> strange case.
> 
> In respect of:
> 
>    re_html = re.compile(r"</HTML",re.I)
> 
> Was thinking that maybe a better pattern might be:
> 
>    re_html = re.compile(r"</HTML\s*>\s*$",re.I)
> 
> Ie., explicitly require that </html> is the very last thing in the
> content.
> 
> If I am right that nothing can come after </html> this would be the most
> accurate thing to use and would save wrongly calling something HTML
> when it is not, like this message would be if I stick </html> here. :-)
> 
> Graham
> 
>

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

On 01/05/2005, at 7:32 PM, Nicolas Lehuen wrote:
> I just had to modify apply_fs_data in util.py to allow new-style
> classes to be called (changing the test type(object) = ClassType to
> type(object) in (TypeType, ClassType)), and to rewrite publisher.py as
> enclosed.

Can you post your modified apply_fs_data. Want to see if what you came
up with is similar to mine or whether one or the other of us forgot some
strange case.

In respect of:

   re_html = re.compile(r"</HTML",re.I)

Was thinking that maybe a better pattern might be:

   re_html = re.compile(r"</HTML\s*>\s*$",re.I)

Ie., explicitly require that </html> is the very last thing in the 
content.

If I am right that nothing can come after </html> this would be the most
accurate thing to use and would save wrongly calling something HTML
when it is not, like this message would be if I stick </html> here. :-)

Graham

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Nicolas Lehuen <ni...@gmail.com>.

OK, I now have a proper implementation of my proposals, working with
iterables, generators, new-style classes and old-style classes.
Generators are especially fun, they allow you to write things like :

def index(req):
    yield '<html><body>'
    yield greetings(req,'Nicolas')
    yield '</body></html>'

def greetings(req,first_name):
    yield '<p>Hello, '
    yield first_name
    yield '!</p>'

I just had to modify apply_fs_data in util.py to allow new-style
classes to be called (changing the test type(object) = ClassType to
type(object) in (TypeType, ClassType)), and to rewrite publisher.py as
enclosed.

All unit tests run OK. The only problem, like you wrote, Graham, is
that we have to change the traversal & publishing rules to allow
old-style classes and new-style classes publishing. This may induce
some problems if some developers thought their classes were safe from
publishing. Ah, well, I guess we could only allow this for power
users, I'll have a look at how we could add a PythonOption for this.

For now, I've kept the current traversal & publishing rules, so this
should not break anything. But as we are trying to make a 3.2 release,
I'm not going to put this on the trunk. I'm building a branch named
"experimental-publisher", and we'll merge it after the 3.2 release.

Regards,
Nicolas

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Nicolas Lehuen <ni...@gmail.com>.

Well, the Content-Encoding header is for HTTP content encoding, as
explained in the RFC :

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

For example, the server can gzip the result page and set
Content-Encoding to gzip. This is not related to the character
encoding, which is given as the charset parameter of the Content-Type
header.

Regards,
Nicolas

On 4/30/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
> Not too start this up again, but just for the record, wanted to add
> that there
> is the req.content_encoding encoding member. Not sure how this comes
> into the
> picture at all. Something to consider later. :-)
> 
> Graham
> 
> On 30/04/2005, at 9:35 PM, Nicolas Lehuen wrote:
> 
> > [Woops, forgot to put the list in the recipients.]
> >
> > I think in this case the default conversion used is UTF8. Ideally, a
> > developer returning Unicode strings from functions should have a way
> > to decide in what encoding (UTF-8, iso-latin-1, etc.) the string
> > should be returned to the client.
> >
> > One possible way to do that would be to parse the content-type header,
> > i.e. if the developer set the content type header to "text/html;
> > charset=iso-8859-1", then we know the developer expect the result to
> > be encoded in iso-8859-1, so we can do result =
> > object.encode('iso-8859-1').
> >
> > Here is some tentative code for this :
> >
> > re_charset = re.compile(r"charset\s*=\s*([^\s;]+)");
> >
> > def publish_object(req, object):
> >     if callable(object):
> >         req.form = util.FieldStorage(req, keep_blank_values=1)
> >         return publish_object(req,util.apply_fs_data(object, req.form,
> > req=req))
> >     elif hasattr(object,'__iter__'):
> >         result = False
> >         for item in object:
> >             result |= publish_object(req,item)
> >         return result
> >     else:
> >         if object is None:
> >             return False
> >         elif isinstance(object,UnicodeType):
> >             # We try to detect the character encoding
> >             # from the Content-Type header
> >             if req._content_type_set:
> >                 charset = re_charset.search(req.content_type)
> >                 if charset:
> >                     charset = charset.group(1)
> >                 else:
> >                     charset = 'UTF8'
> >                     req.content_type += '; charset=UTF8'
> >             else:
> >                 charset = 'UTF8'
> >
> >             result = object.encode(charset)
> >         else:
> >             result = str(object)
> >
> >     [...]
> >
> > Regards,
> > Nicolas
> >
> > On 4/30/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
> >>
> >> On 30/04/2005, at 6:37 PM, Nicolas Lehuen wrote:
> >>>         elif isinstance(object,UnicodeType):
> >>>             # TODO : this skips all encoding issues, which is VERY
> >>> BAD
> >>>             # I don't even understand how the req.write below can
> >>> work
> >>> !
> >>>             result = object
> >>>         else:
> >>>             result = str(object)
> >>
> >> What do you see is the issue that required an explicit check for
> >> UnicodeType
> >> and avoidance of converting it with str().
> >>
> >> As the code is above, req.write() will be called with the
> >> UnicodeObject. This
> >> will work provided that the Unicode string can be converted into a
> >> normal
> >> string using the default encoding. Ie., in underlying C code
> >> PyArg_ParseTuple
> >> will use "s", meaning:
> >>
> >> "s" (string or Unicode object) [char *]
> >>    Convert a Python string or Unicode object to a C pointer to a
> >> character
> >>    string. You must not provide storage for the string itself; a
> >> pointer
> >>    to an existing string is stored into the character pointer variable
> >>    whose address you pass. The C string is null-terminated. The Python
> >>    string must not contain embedded null bytes; if it does, a
> >> TypeError
> >>    exception is raised. Unicode objects are converted to C strings
> >> using
> >>    the default encoding. If this conversion fails, an UnicodeError is
> >> raised.
> >>
> >> I think though that applying str() in the Python code to the Unicode
> >> string
> >> probably yields the same result. Ie., str(u'123') results in encode()
> >> method
> >> of Unicode string object being called.
> >>
> >> S.encode([encoding[,errors]]) -> string
> >>
> >> Return an encoded string version of S. Default encoding is the current
> >> default string encoding. errors may be given to set a different error
> >> handling scheme. Default is 'strict' meaning that encoding errors
> >> raise
> >> a UnicodeEncodeError. Other possible values are 'ignore', 'replace'
> >> and
> >> 'xmlcharrefreplace' as well as any other name registered with
> >> codecs.register_error that can handle UnicodeEncodeErrors.
> >>
> >> In other words, I don't believe there is any difference between
> >> converting
> >> it using str() before the call to req.write() as there is passing
> >> Unicode
> >> string direct to req.write(). Thus, explicit check for UnicodeType
> >> probably
> >> not required.
> >>
> >> Graham
> >>
> >>
> 
>

Re: mod_python.publisher : proposal for a few implementation changes

Posted by Graham Dumpleton <gr...@dscpl.com.au>.

Not too start this up again, but just for the record, wanted to add 
that there
is the req.content_encoding encoding member. Not sure how this comes 
into the
picture at all. Something to consider later. :-)

Graham

On 30/04/2005, at 9:35 PM, Nicolas Lehuen wrote:

> [Woops, forgot to put the list in the recipients.]
>
> I think in this case the default conversion used is UTF8. Ideally, a
> developer returning Unicode strings from functions should have a way
> to decide in what encoding (UTF-8, iso-latin-1, etc.) the string
> should be returned to the client.
>
> One possible way to do that would be to parse the content-type header,
> i.e. if the developer set the content type header to "text/html;
> charset=iso-8859-1", then we know the developer expect the result to
> be encoded in iso-8859-1, so we can do result =
> object.encode('iso-8859-1').
>
> Here is some tentative code for this :
>
> re_charset = re.compile(r"charset\s*=\s*([^\s;]+)");
>
> def publish_object(req, object):
>     if callable(object):
>         req.form = util.FieldStorage(req, keep_blank_values=1)
>         return publish_object(req,util.apply_fs_data(object, req.form, 
> req=req))
>     elif hasattr(object,'__iter__'):
>         result = False
>         for item in object:
>             result |= publish_object(req,item)
>         return result
>     else:
>         if object is None:
>             return False
>         elif isinstance(object,UnicodeType):
>             # We try to detect the character encoding
>             # from the Content-Type header
>             if req._content_type_set:
>                 charset = re_charset.search(req.content_type)
>                 if charset:
>                     charset = charset.group(1)
>                 else:
>                     charset = 'UTF8'
>                     req.content_type += '; charset=UTF8'
>             else:
>                 charset = 'UTF8'
>
>             result = object.encode(charset)
>         else:
>             result = str(object)
>
>     [...]
>
> Regards,
> Nicolas
>
> On 4/30/05, Graham Dumpleton <gr...@dscpl.com.au> wrote:
>>
>> On 30/04/2005, at 6:37 PM, Nicolas Lehuen wrote:
>>>         elif isinstance(object,UnicodeType):
>>>             # TODO : this skips all encoding issues, which is VERY 
>>> BAD
>>>             # I don't even understand how the req.write below can 
>>> work
>>> !
>>>             result = object
>>>         else:
>>>             result = str(object)
>>
>> What do you see is the issue that required an explicit check for
>> UnicodeType
>> and avoidance of converting it with str().
>>
>> As the code is above, req.write() will be called with the
>> UnicodeObject. This
>> will work provided that the Unicode string can be converted into a
>> normal
>> string using the default encoding. Ie., in underlying C code
>> PyArg_ParseTuple
>> will use "s", meaning:
>>
>> "s" (string or Unicode object) [char *]
>>    Convert a Python string or Unicode object to a C pointer to a
>> character
>>    string. You must not provide storage for the string itself; a 
>> pointer
>>    to an existing string is stored into the character pointer variable
>>    whose address you pass. The C string is null-terminated. The Python
>>    string must not contain embedded null bytes; if it does, a 
>> TypeError
>>    exception is raised. Unicode objects are converted to C strings 
>> using
>>    the default encoding. If this conversion fails, an UnicodeError is
>> raised.
>>
>> I think though that applying str() in the Python code to the Unicode
>> string
>> probably yields the same result. Ie., str(u'123') results in encode()
>> method
>> of Unicode string object being called.
>>
>> S.encode([encoding[,errors]]) -> string
>>
>> Return an encoded string version of S. Default encoding is the current
>> default string encoding. errors may be given to set a different error
>> handling scheme. Default is 'strict' meaning that encoding errors 
>> raise
>> a UnicodeEncodeError. Other possible values are 'ignore', 'replace' 
>> and
>> 'xmlcharrefreplace' as well as any other name registered with
>> codecs.register_error that can handle UnicodeEncodeErrors.
>>
>> In other words, I don't believe there is any difference between
>> converting
>> it using str() before the call to req.write() as there is passing
>> Unicode
>> string direct to req.write(). Thus, explicit check for UnicodeType
>> probably
>> not required.
>>
>> Graham
>>
>>