You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Daniel Quinlan <qu...@pathname.com> on 2005/08/20 03:13:05 UTC

Re: Preliminary design proposal for charset normalization support in SpamAssassin

John Gardiner Myers <jg...@proofpoint.com> writes:

> the normalization support needs to be optional, defaulting to off.

What do you estimate the overhead would be?

> Better would be Mozilla's universal charset detector, which I would
> have to wrap up as a cpan module.

What is the license of it?  We try to avoid requiring additional
external CPAN modules.  We might want to ship with it... if the license
(and ASF policy) allows.
 
> The other issue is that Mail::SpamAssassin::HTML uses two calls to
> pack("C0A*", ...) in order to strip Perl's utf-8 flag from text going
> into and out of HTML::Parser.  When doing charset normalization, these
> two pack calls need to be removed.  In order for HTML::Parser to
> correctly handle utf8, one needs minimum versions of Perl 5.8 and
> HTML::Parser 3.39_90.  HTML::Parser 3.43 might be a better minimum
> version--I haven't reviewed the severity of the utf8 bug fixed in that
> release.  I see two possibilities:
> 
> 1) Condition the two pack calls on version checks: (perl < 5.8 ||
>    HTML::Parser < 3.43)
> 
> 2) Condition the two pack calls on charset normalization disabled.

We can probably safely up the requirement for HTML::Parser in our next
major revision.  Conditioning is also okay.

We should pay special attention to behaving as MUAs.  I believe some
MUAs will actually ignore the MIME character set and use the one
specified in the message HTML (if it is HTML).  We shouldn't necessarily
assume all MUAs have been configured to use the local character set at
all times.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by John Gardiner Myers <jg...@proofpoint.com>.
Daniel Quinlan wrote:

>What do you estimate the overhead would be?
>  
>
Hard to estimate without some design choices, like when exactly to run 
the charset detector.  The charset detector runs about 10-20 state 
machines over the text, in parallel.  The conversion itself is another 
pass and another copy of the text.  When the text has characters outside 
of iso-8859-1, one then has to pay the cost of Perl's Unicode regex 
support for each of the rules.  That means the cost will depend on the 
percentage of non-iso-8859-1 messages in the message stream.

>What is the license of [Mozilla's universal charset detector]?
>
MPL. 

>We can probably safely up the requirement for HTML::Parser in our next
>major revision.  Conditioning is also okay.
>  
>
Without the second pack call, any non-iso-8859-1 character entities will 
cause the output to have the utf-8 bit set and thus engage Perl's 
Unicode regex support.

>We should pay special attention to behaving as MUAs.  I believe some
>MUAs will actually ignore the MIME character set and use the one
>specified in the message HTML (if it is HTML).  We shouldn't necessarily
>assume all MUAs have been configured to use the local character set at
>all times.
>  
>
Something to look into.  One would have to pre-parse the HTML to see if 
there is a charset label.


Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Faisal N Jawdat <fa...@faisal.com>.
Apparently I'm on crack (although I'm not the only one) about it  
already being required.  Was it always required and I just didn't  
notice, or did I just miss the announcement that it was now  
required?  In any case, I agree that we should rev the requirement.

> We already require HTML::Parser.  The only question is whether or  
> not we
> should raise the minimum required version.  I suggested that we  
> raise it
> and I doubt anyone will have major objections since we haven't raised
> the minimum version for a while.

FWIW, CPAN doesn't even list 3.24 on the base page any more....

http://search.cpan.org/dist/HTML-Parser/

-faisal


Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Daniel Quinlan <qu...@pathname.com>.
Faisal N Jawdat <fa...@faisal.com> writes:

>> We can probably safely up the requirement for HTML::Parser in our next
>> major revision.  Conditioning is also okay.

> What are reasons against doing this?  I've sat through the "What is  
> the regexp for matching this HTML pattern", "Don't use a regexp to  
> parse HTML, use a real parser", "Here, HTML::Parser", "Wait, that's  
> an optional install -- we can't rely on that" conversation at least 4  
> times *in the last 10 days*.

I have no idea what your point is.

We already require HTML::Parser.  The only question is whether or not we
should raise the minimum required version.  I suggested that we raise it
and I doubt anyone will have major objections since we haven't raised
the minimum version for a while.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Faisal N Jawdat <fa...@faisal.com>.
> We can probably safely up the requirement for HTML::Parser in our next
> major revision.  Conditioning is also okay.

What are reasons against doing this?  I've sat through the "What is  
the regexp for matching this HTML pattern", "Don't use a regexp to  
parse HTML, use a real parser", "Here, HTML::Parser", "Wait, that's  
an optional install -- we can't rely on that" conversation at least 4  
times *in the last 10 days*.

-faisal