You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Paul Querna <ch...@force-elite.com> on 2006/07/03 17:18:54 UTC

Re: atom feeds for projects

robert burrell donkin wrote:
> On 7/2/06, Sam Ruby <ru...@apache.org> wrote:
>>
>> robert burrell donkin wrote:
>> > the mailing list archives at apache run on mod_mbox which also supplies
>> > atom
>> > feeds for these lists. i've added the feed from general to the front
>> page
>> > and think it'd be cool to add feeds to the pages in projects as well.
>> since
>> > the focus of  podlings should be recruiting developers (not users) i'm
>> > thinking of adding feeds to the dev lists.
>> >
>> > opinions?
>> >
>> > volunteers?
>>
>> Just be aware that the feeds produced are rarely well formed XML, mostly
>> due to encoding issues.  For example: http://tinyurl.com/h5f7t
>>
>> I tried to submit a patch based on my limited understanding of the code,
>> and was told that my patch wasn't acceptable

To be clear, AFAIK, there was never a patch for mod_mbox -- it was a
Ruby file that only solved part of the problem. Again, AFAIK, no one
ever wrote a patch in C for mod_mbox to attempt to resolve this issue.

>> and that XML parsers that
>> require well-formedness were broken anyway -- despite that being
>> explicitly what the spec requires.

Its unfortunate that the discussion degraded into that.

>> I'd be willing to try again, but only if there was active interest in
>> actually fixing the problem.

Yes, there is active interest in making mod_box better.

> IMO we should fix the feed but i'm not involved with mod_mbox (or httpd).
> anyone who is want to jump in here?

The primary bug is lack of encoding support.  mod-mbox just doesn't even
try to do it.

Someone needs to write something that touches many parts of the code,
using the apr_xlate API to convert the content to utf-8.  (This would
also help it validate as HTML).  Once that is done, we do need to worry
about out of range characters, some of which would be removed, others
possibly HTML encoded.

For future discussion of this please use dev@httpd.

Thanks,

-Paul


Re: atom feeds for projects

Posted by Paul Querna <ch...@force-elite.com>.
Sam Ruby wrote:
> And the impact is dramatic.  IE7 won't display any feed that is not well
> formed.  FireFox 2 will stop at the first error.  Bloglines (I'm told,
> but haven't verified) will fall back to a rather sub-optimal RSS parser
> to handle broken Atom feeds - and the results aren't pretty.  Suffice it
> to say (and I say this primarily for Paul's benefit) - I believe that
> either this code, or code that performs a similar function - will
> provide an immediate improvement to Bloglines users who subscribe to
> mod_mbox produced feeds.
> 

Sam,

I don't feel that it is appropriate to discuss my day job on this list,
or in this context.

I don't work on mod_mbox for my job. I have worked on it in the past
because I wanted something that is usable for my personal life as an
developer at the ASF.

Thanks,

-Paul

Re: atom feeds for projects

Posted by Sam Ruby <ru...@apache.org>.
Garrett Rooney wrote:
> On 7/4/06, Sam Ruby <ru...@apache.org> wrote:
>> Garrett Rooney wrote:
>> > On 7/4/06, Sam Ruby <ru...@apache.org> wrote:
>> >
>> >> > To be clear, AFAIK, there was never a patch for mod_mbox -- it was a
>> >> > Ruby file that only solved part of the problem. Again, AFAIK, no one
>> >> > ever wrote a patch in C for mod_mbox to attempt to resolve this 
>> issue.
>> >>
>> >> I offered.  The response was, and I quote, "Erm, no".
>> >
>> > The "Erm, no" was in response to the approach, not the offer to 
>> help, IIRC.
>> >
>> > If you're willing to fix the problem the right way, by adding real
>> > support for character sets to mod_mbox, I'm sure nobody would have a
>> > problem with  that.
>>
>> You chose to snip the portion where I argue that the approach I outlined
>> is necessary, at least as a fall-back/safety net.  Care to explain why
>> such a fall-back/safety net isn't necessary or appropriate?
> 
> No argument that it's necessary, but it seems kind of pointless to fix
> that part without fixing the underlying fact that mod_mbox is totally
> ignorant of character sets.  You'll get perfectly "valid" junk in the
> vast majority of cases, that doesn't seem like a real step forward to
> me.

"vast majority"?  I beg to differ.

In any case, the current code assumes that everything is valid utf-8. 
And that assumption does not seem to have any indication of changing 
since I posted my offer last October.

For e-mail messages that are either correct UTF-8 or US-ASCII, the 
current code just works.  That's a substantial portion of messages.

With the code I posted, the majority of the messages which are 
iso-8859-1 will be converted to utf-8.  Even if they don't contain the 
proper charset headers.  And if they happen to be "salted" with win-1252 
characters like "smart quotes", those will be corrected too.

I wager that this covers a substantial portion of the non-spam messages 
in your in-box.

And the impact is dramatic.  IE7 won't display any feed that is not well 
formed.  FireFox 2 will stop at the first error.  Bloglines (I'm told, 
but haven't verified) will fall back to a rather sub-optimal RSS parser 
to handle broken Atom feeds - and the results aren't pretty.  Suffice it 
to say (and I say this primarily for Paul's benefit) - I believe that 
either this code, or code that performs a similar function - will 
provide an immediate improvement to Bloglines users who subscribe to 
mod_mbox produced feeds.

As to handing the charset correctly, this can proceed incrementally. 
Parsing the header isn't all that hard.  Fixing the body given the 
charset should be only one call.  Expanding this to the subject and from 
headers (presuming that they, too, are covered by the charset, I haven't 
checked what the specs and/or common practice indicates in this manner) 
can be done at leisure.

I'm willing to help there too.  But I have seen too many emails and too 
many tools that are broken when it comes to encoding to want to invest 
the time in learning how to build and deploy a test version of mod_mbox 
as long as the prevailing mood of the project can be summed up with 
"Erm, no".

- Sam Ruby

Re: atom feeds for projects

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On 7/4/06, Sam Ruby <ru...@apache.org> wrote:
> Garrett Rooney wrote:
> > On 7/4/06, Sam Ruby <ru...@apache.org> wrote:
> >
> >> > To be clear, AFAIK, there was never a patch for mod_mbox -- it was a
> >> > Ruby file that only solved part of the problem. Again, AFAIK, no one
> >> > ever wrote a patch in C for mod_mbox to attempt to resolve this issue.
> >>
> >> I offered.  The response was, and I quote, "Erm, no".
> >
> > The "Erm, no" was in response to the approach, not the offer to help, IIRC.
> >
> > If you're willing to fix the problem the right way, by adding real
> > support for character sets to mod_mbox, I'm sure nobody would have a
> > problem with  that.
>
> You chose to snip the portion where I argue that the approach I outlined
> is necessary, at least as a fall-back/safety net.  Care to explain why
> such a fall-back/safety net isn't necessary or appropriate?

No argument that it's necessary, but it seems kind of pointless to fix
that part without fixing the underlying fact that mod_mbox is totally
ignorant of character sets.  You'll get perfectly "valid" junk in the
vast majority of cases, that doesn't seem like a real step forward to
me.

-garrett

Re: atom feeds for projects

Posted by Sam Ruby <ru...@apache.org>.
Garrett Rooney wrote:
> On 7/4/06, Sam Ruby <ru...@apache.org> wrote:
> 
>> > To be clear, AFAIK, there was never a patch for mod_mbox -- it was a
>> > Ruby file that only solved part of the problem. Again, AFAIK, no one
>> > ever wrote a patch in C for mod_mbox to attempt to resolve this issue.
>>
>> I offered.  The response was, and I quote, "Erm, no".
> 
> The "Erm, no" was in response to the approach, not the offer to help, IIRC.
> 
> If you're willing to fix the problem the right way, by adding real
> support for character sets to mod_mbox, I'm sure nobody would have a
> problem with  that.

You chose to snip the portion where I argue that the approach I outlined 
is necessary, at least as a fall-back/safety net.  Care to explain why 
such a fall-back/safety net isn't necessary or appropriate?

- Sam Ruby

Re: atom feeds for projects

Posted by Garrett Rooney <ro...@electricjellyfish.net>.
On 7/4/06, Sam Ruby <ru...@apache.org> wrote:

> > To be clear, AFAIK, there was never a patch for mod_mbox -- it was a
> > Ruby file that only solved part of the problem. Again, AFAIK, no one
> > ever wrote a patch in C for mod_mbox to attempt to resolve this issue.
>
> I offered.  The response was, and I quote, "Erm, no".

The "Erm, no" was in response to the approach, not the offer to help, IIRC.

If you're willing to fix the problem the right way, by adding real
support for character sets to mod_mbox, I'm sure nobody would have a
problem with  that.

-garrett

Re: atom feeds for projects

Posted by Sam Ruby <ru...@apache.org>.
Paul Querna wrote:
> robert burrell donkin wrote:
>> On 7/2/06, Sam Ruby <ru...@apache.org> wrote:
>>> robert burrell donkin wrote:
>>>> the mailing list archives at apache run on mod_mbox which also supplies
>>>> atom
>>>> feeds for these lists. i've added the feed from general to the front
>>> page
>>>> and think it'd be cool to add feeds to the pages in projects as well.
>>> since
>>>> the focus of  podlings should be recruiting developers (not users) i'm
>>>> thinking of adding feeds to the dev lists.
>>>>
>>>> opinions?
>>>>
>>>> volunteers?
>>> Just be aware that the feeds produced are rarely well formed XML, mostly
>>> due to encoding issues.  For example: http://tinyurl.com/h5f7t
>>>
>>> I tried to submit a patch based on my limited understanding of the code,
>>> and was told that my patch wasn't acceptable
> 
> To be clear, AFAIK, there was never a patch for mod_mbox -- it was a
> Ruby file that only solved part of the problem. Again, AFAIK, no one
> ever wrote a patch in C for mod_mbox to attempt to resolve this issue.

I offered.  The response was, and I quote, "Erm, no".

>>> and that XML parsers that
>>> require well-formedness were broken anyway -- despite that being
>>> explicitly what the spec requires.
> 
> Its unfortunate that the discussion degraded into that.
> 
>>> I'd be willing to try again, but only if there was active interest in
>>> actually fixing the problem.
> 
> Yes, there is active interest in making mod_box better.

This thread was in October, and since then the feed has not improved.

>> IMO we should fix the feed but i'm not involved with mod_mbox (or httpd).
>> anyone who is want to jump in here?
> 
> The primary bug is lack of encoding support.  mod-mbox just doesn't even
> try to do it.
> 
> Someone needs to write something that touches many parts of the code,
> using the apr_xlate API to convert the content to utf-8.  (This would
> also help it validate as HTML).  Once that is done, we do need to worry
> about out of range characters, some of which would be removed, others
> possibly HTML encoded.

Inside the message, there may be a content-type header.  Inside this 
header, there may be a charset parameter.  This charset parameter may be 
quoted, or it may not.  It may be correct, or it may not.

It would be worthwhile to attempt to extract this, and to attempt to 
convert at least the body portion of the message to utf-8.

But in any case, the results after the conversion need to be sanitized.
The Ruby code that I offered to convert to C does exactly that - takes
something that is allegedly utf-8 and corrects a number of common
errors, and produces something that is guaranteed to be well formed.  Of
course, if you feed in absolute garbage, what you will get back is well
formed line noise.

As promised, here is a C version that does approximately the same thing:

http://intertwingly.net/stories/2006/07/04/clean_utf8_for_xml.c

This may be useful in display_atom_entry, and mbox_static_message, 
mbox_xml_message.  It is safer than using <!CDATA[ ]]> as email messages 
(such as this one) may contain such strings.

Also note that if the content_type of the original MIME message contains 
the string "html", you might want to adjust the type attribute on the 
atom:content element accordingly.

But back to the original point: even if nobody puts in the effort to 
correctly interpret that message based on the specified charset, the 
addition of this code or something similar is (1) necessary anyway, (2) 
will make the result no worse than it currently is and has been for 
months, and (3) will make a marked improvement in that it will correct a 
number of common errors.

Please feel free to treat the code mentioned above as being under the 
Apache Software License version 2.0.  If you don't like my indentation 
or bracing style, by all means, adjust it to your tastes.  Convert the 
malloc to use the appropriate apr call.  Or if you prefer, throw it all 
away, and start over.  I don't care, I just want to see the Atom feeds 
produced to be clean and valid.

> For future discussion of this please use dev@httpd.

OK

> Thanks,
> 
> -Paul

- Sam Ruby