You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Sam Ruby <ru...@apache.org> on 2005/10/12 05:15:54 UTC

mod_mbox and atom 1.0

I note that mod_mbox now produces Atom 1.0 feeds.  Excellent!

 = = =

There is a feedvalidator that can be used to identify areas of improvement.

  http://www.feedvalidator.org/

The highest priority is to make sure that the encoding is correct.  As
it currently stands, many of these feeds are not well formed XML,
meaning that they will be rejected by conformant XML parsers.  Fixing
this will improve the usability of the HTML pages.

An outline of what needs to be done can be found here:

  http://intertwingly.net/stories/2005/09/28/xchar.rb

This is in Ruby.  I can translate to C any portions you may have
questions on.

 = = =

There also is a minor issue regarding canonicalization.  Also, email
addresses should be split out from the name (I'll go fix the
feedvalidator to issue warnings on this).

Finally, it is clear that the authors of mod_mbox know a thing or two
about CSS.  Such techniques can also be applied to feeds.  Take a look
at mine for an example, which I am sure you can improve on:

http://intertwingly.net/blog/index.atom

- Sam Ruby

Re: mod_mbox and atom 1.0

Posted by Sam Ruby <ru...@apache.org>.
Nick Kew wrote:
> On Wednesday 12 October 2005 04:31, Paul Querna wrote:
> 
>>>An outline of what needs to be done can be found here:
>>>
>>>  http://intertwingly.net/stories/2005/09/28/xchar.rb
> 
> Erm, no.  We need to reencode from any incoming charset.
> We don't need to reinvent any wheels by recreating individual
> charset conversion tables.

There are two special cases that merits consideration.

If *after* you convert to unicode, you end up with

1) Characters that are outside the valid range for XML then
   they must be replaced:

      0x9, 0xA, 0xD,
      (0x20..0xD7FF),
      (0xE000..0xFFFD),
      (0x10000..0x10FFFF)

   The most common character that causes such a problem is
   a form-feed character, common in RFC's for example.

2) Characters in the range of (0x80..0x9F) are either reserved or
   are control characters.  27 of these characters were "embraced
   and extended" by our friends in Redmond.  That's the single
   table that you so viscerally reacted to.

   The most common characters that cause such problems are
   the so-called smart-quotes.

>>Right now mod_mbox does *no* encoding translation.  We really need to be
>>calling apr_xlate all over, and turning everything into UTF-8 First.
>>Currently, each item is encoded in whatever the client program sent it
>>as... which isn't good. 
> 
> Even the HTML is erroneously sent as iso-8859-1, so posts that arrive as
> utf-8 (eg from wrowe) display incorrectly!  As of now it's not really fit for 
> purpose.  We should fix this for the benefit of all display formats, rather
> than address html, atom, or indeed anything else in isolation.

One possibility is to convert characters about 0xFF to numeric character
references, like &#x2019;.  Even though it it wrong to do so, people
often consume feeds with regular expressions, "aggregate" bits from
various places using the equivalent of strcat, and toss the results into
a web page, leaving the default as iso-8859-1.  Numeric character
references have the benefit of meaning the same thing independent of
whether the bytes are interpreted as iso-8859-1, utf-8, or even us-ascii.

> Regarding the mail archives, the ideal solution would be to transcode
> incoming messages to a homogenous utf-8 before storing them.  To make
> that useful, we'd need to transcode the existing archives too, though that
> would just be a one-off script.  I see a mod_smtpd filter thrashing around
> that to-do list ...  dammit, it's the long-awaited updates to charset_lite!

Just mentioning in passing: if you have a message of uncertain encoding,
there is a regular expression that can be used to determine if it is
likely in utf-8 already.  Given the design of utf-8, false positives are
rare, and the chances drop as the length of the message increases.

> The harder bit to deal with is _local_ encoding in a different charsets in
> header lines.  That's a PITA, and is AFAIK peculiar to SMTP.

- Sam Ruby

Re: mod_mbox and atom 1.0

Posted by Nick Kew <ni...@webthing.com>.
On Wednesday 12 October 2005 04:31, Paul Querna wrote:

> > An outline of what needs to be done can be found here:
> > 
> >   http://intertwingly.net/stories/2005/09/28/xchar.rb

Erm, no.  We need to reencode from any incoming charset.
We don't need to reinvent any wheels by recreating individual
charset conversion tables.

> Right now mod_mbox does *no* encoding translation.  We really need to be
> calling apr_xlate all over, and turning everything into UTF-8 First.
> Currently, each item is encoded in whatever the client program sent it
> as... which isn't good.

Even the HTML is erroneously sent as iso-8859-1, so posts that arrive as
utf-8 (eg from wrowe) display incorrectly!  As of now it's not really fit for 
purpose.  We should fix this for the benefit of all display formats, rather
than address html, atom, or indeed anything else in isolation.

Regarding the mail archives, the ideal solution would be to transcode
incoming messages to a homogenous utf-8 before storing them.  To make
that useful, we'd need to transcode the existing archives too, though that
would just be a one-off script.  I see a mod_smtpd filter thrashing around
that to-do list ...  dammit, it's the long-awaited updates to charset_lite!

The harder bit to deal with is _local_ encoding in a different charsets in
header lines.  That's a PITA, and is AFAIK peculiar to SMTP.

-- 
Nick Kew

Re: mod_mbox and atom 1.0

Posted by Paul Querna <ch...@force-elite.com>.
Sam Ruby wrote:
> The highest priority is to make sure that the encoding is correct.  As
> it currently stands, many of these feeds are not well formed XML,
> meaning that they will be rejected by conformant XML parsers.  Fixing
> this will improve the usability of the HTML pages.
> 
> An outline of what needs to be done can be found here:
> 
>   http://intertwingly.net/stories/2005/09/28/xchar.rb
> 
> This is in Ruby.  I can translate to C any portions you may have
> questions on.

Well, this is actually a small part of the whole encoding problem.

Right now mod_mbox does *no* encoding translation.  We really need to be 
calling apr_xlate all over, and turning everything into UTF-8 First. 
Currently, each item is encoded in whatever the client program sent it 
as... which isn't good.

> There also is a minor issue regarding canonicalization.  Also, email
> addresses should be split out from the name (I'll go fix the
> feedvalidator to issue warnings on this).

Yep, I saw that part of the spec, but I was just being lazy when I wrote 
the atom stuff.