You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by John Gardiner Myers <jg...@proofpoint.com> on 2005/08/20 01:58:56 UTC

Preliminary design proposal for charset normalization support in SpamAssassin

The following is a preliminary proposal for how to add support for
normalization of charsets into Perl's Unicode support.  The primary
reason I want to do this work is to improve the ability of
SpamAssassin to discriminate between Japanese ham and Japanese spam.

SpamAssassin currently ignores charset information, effectively
assuming all mail is in iso-8859-1.  This works for users whose ham is
encoded in iso-8859-1 and mostly works for users whose ham is encoded
in other single-byte charsets.  For East Asian languages, this is
insufficient for doing text analysis.

Since a large number of SpamAssassin users are likely to be
uninterested in East Asian ham and thus unlikely to want to pay the
cost of charset normalization, the normalization support needs to be
optional, defaulting to off.

Some messages contain unlabeled charsets, others use MIME charset
labels.  Some MIME charset labels are not useful
(e.g. "unknown-8bit").  To handle such nonlabeled data, it is
necessary to run a charset detector over the text in order to
determine what to convert it from.  Encode::Guess effectively requires
the caller to specify the language of the text, so I consider it too
simplistic.  Better would be Mozilla's universal charset detector,
which I would have to wrap up as a cpan module.

It is common for Korean messages to have an incorrect MIME label of
"iso-8859-1", so it may be necessary to run a charset detector even
over MIME-labeled charsets.

After the charset has been determined, either from the MIME label or
the charset detector, the data needs to be converted from that charset
to Perl's internal utf8 form.  Encode::decode() is the obvious choice
for this, though I can see reasons why an installation might want to
be able to replace the charset converters with some other
implementation.

The following functions, immediately after they all
Mail::SpamAssassin::Message::Node::decode, need to call a
function that does charset normalization.

* Mail::SpamAssassin::Message::get_rendered_body_text_array
* Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
* Mail::SpamAssassin::Message::get_decoded_body_text_array

Furthermore:

* Mail::SpamAssassin::Message::Node::_decode_header
* Mail::SpamAssassin::Message::Node::__decode_header

also need to call a function to do charset normalization.
_decode_header for unlabeled charset data, __decode_header for for
MIME encoded-words.

This new charset normalization function will take as arguments the
text and any MIME charset label.  The function calls the charset
detector and converter as necessary and returns the normalized text in
Perl's internal form.  The returned text will only have the utf8 flag
set if the input charset was not us-ascii or iso-8859-1. 

This new charset normalization function should most likely use a
plugin callback to do all the work, though it only makes sense for one
loaded plugin to implement the callback.  If no plugin implements the
callback, then it should simply return the input text, preserving the
current behavior.

The other issue is that Mail::SpamAssassin::HTML uses two calls to
pack("C0A*", ...) in order to strip Perl's utf-8 flag from text going
into and out of HTML::Parser.  When doing charset normalization, these
two pack calls need to be removed.  In order for HTML::Parser to
correctly handle utf8, one needs minimum versions of Perl 5.8 and
HTML::Parser 3.39_90.  HTML::Parser 3.43 might be a better minimum
version--I haven't reviewed the severity of the utf8 bug fixed in that
release.  I see two possibilities:

1) Condition the two pack calls on version checks: (perl < 5.8 ||
   HTML::Parser < 3.43)

2) Condition the two pack calls on charset normalization disabled.

Comments?

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by John Gardiner Myers <jg...@proofpoint.com>.

Daniel Quinlan wrote:

>What do you estimate the overhead would be?
>  
>
Hard to estimate without some design choices, like when exactly to run 
the charset detector.  The charset detector runs about 10-20 state 
machines over the text, in parallel.  The conversion itself is another 
pass and another copy of the text.  When the text has characters outside 
of iso-8859-1, one then has to pay the cost of Perl's Unicode regex 
support for each of the rules.  That means the cost will depend on the 
percentage of non-iso-8859-1 messages in the message stream.

>What is the license of [Mozilla's universal charset detector]?
>
MPL. 

>We can probably safely up the requirement for HTML::Parser in our next
>major revision.  Conditioning is also okay.
>  
>
Without the second pack call, any non-iso-8859-1 character entities will 
cause the output to have the utf-8 bit set and thus engage Perl's 
Unicode regex support.

>We should pay special attention to behaving as MUAs.  I believe some
>MUAs will actually ignore the MIME character set and use the one
>specified in the message HTML (if it is HTML).  We shouldn't necessarily
>assume all MUAs have been configured to use the local character set at
>all times.
>  
>
Something to look into.  One would have to pre-parse the HTML to see if 
there is a charset label.

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Faisal N Jawdat <fa...@faisal.com>.

Apparently I'm on crack (although I'm not the only one) about it  
already being required.  Was it always required and I just didn't  
notice, or did I just miss the announcement that it was now  
required?  In any case, I agree that we should rev the requirement.

> We already require HTML::Parser.  The only question is whether or  
> not we
> should raise the minimum required version.  I suggested that we  
> raise it
> and I doubt anyone will have major objections since we haven't raised
> the minimum version for a while.

FWIW, CPAN doesn't even list 3.24 on the base page any more....

http://search.cpan.org/dist/HTML-Parser/

-faisal

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Daniel Quinlan <qu...@pathname.com>.

Faisal N Jawdat <fa...@faisal.com> writes:

>> We can probably safely up the requirement for HTML::Parser in our next
>> major revision.  Conditioning is also okay.

> What are reasons against doing this?  I've sat through the "What is  
> the regexp for matching this HTML pattern", "Don't use a regexp to  
> parse HTML, use a real parser", "Here, HTML::Parser", "Wait, that's  
> an optional install -- we can't rely on that" conversation at least 4  
> times *in the last 10 days*.

I have no idea what your point is.

We already require HTML::Parser.  The only question is whether or not we
should raise the minimum required version.  I suggested that we raise it
and I doubt anyone will have major objections since we haven't raised
the minimum version for a while.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Faisal N Jawdat <fa...@faisal.com>.

> We can probably safely up the requirement for HTML::Parser in our next
> major revision.  Conditioning is also okay.

What are reasons against doing this?  I've sat through the "What is  
the regexp for matching this HTML pattern", "Don't use a regexp to  
parse HTML, use a real parser", "Here, HTML::Parser", "Wait, that's  
an optional install -- we can't rely on that" conversation at least 4  
times *in the last 10 days*.

-faisal

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Daniel Quinlan <qu...@pathname.com>.

John Gardiner Myers <jg...@proofpoint.com> writes:

> the normalization support needs to be optional, defaulting to off.

What do you estimate the overhead would be?

> Better would be Mozilla's universal charset detector, which I would
> have to wrap up as a cpan module.

What is the license of it?  We try to avoid requiring additional
external CPAN modules.  We might want to ship with it... if the license
(and ASF policy) allows.

> The other issue is that Mail::SpamAssassin::HTML uses two calls to
> pack("C0A*", ...) in order to strip Perl's utf-8 flag from text going
> into and out of HTML::Parser.  When doing charset normalization, these
> two pack calls need to be removed.  In order for HTML::Parser to
> correctly handle utf8, one needs minimum versions of Perl 5.8 and
> HTML::Parser 3.39_90.  HTML::Parser 3.43 might be a better minimum
> version--I haven't reviewed the severity of the utf8 bug fixed in that
> release.  I see two possibilities:
> 
> 1) Condition the two pack calls on version checks: (perl < 5.8 ||
>    HTML::Parser < 3.43)
> 
> 2) Condition the two pack calls on charset normalization disabled.

We can probably safely up the requirement for HTML::Parser in our next
major revision.  Conditioning is also okay.

We should pay special attention to behaving as MUAs.  I believe some
MUAs will actually ignore the MIME character set and use the one
specified in the message HTML (if it is HTML).  We shouldn't necessarily
assume all MUAs have been configured to use the local character set at
all times.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by John Gardiner Myers <jg...@proofpoint.com>.

Loren Wilton wrote:

>This sounds like it might be possible to decode the entire body text at
>least three times per message.  [...]  A common function might be able to
>decode once and cache the decoded text for the next call.
>  
>
I did plan on having the newly created normalization cache the result 
just like Mail::SpamAssassin::Message::Node::decode does.

>  
>
>>* Mail::SpamAssassin::Message::Node::_decode_header
>>* Mail::SpamAssassin::Message::Node::__decode_header
>>    
>>
>This again sounds like it might be possibly to decode many times per
>message?
>  
>
No, these are only called upon construction.

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Loren Wilton <lw...@earthlink.net>.

> The following functions, immediately after they all
> Mail::SpamAssassin::Message::Node::decode, need to call a
> function that does charset normalization.
>
> * Mail::SpamAssassin::Message::get_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_decoded_body_text_array

This sounds like it might be possible to decode the entire body text at
least three times per message.  This would likely be more overhead than
decoding it once, if that is possible.  Perhaps there is (or could be) a
common function all of these would call to get text, or a common data
repository, that has been decoded.  A common function might be able to
decode once and cache the decoded text for the next call.

> * Mail::SpamAssassin::Message::Node::_decode_header
> * Mail::SpamAssassin::Message::Node::__decode_header
>
> also need to call a function to do charset normalization.
> _decode_header for unlabeled charset data, __decode_header for for
> MIME encoded-words.

This again sounds like it might be possibly to decode many times per
message?

        Loren

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Matt Sergeant <ms...@messagelabs.com>.

On 23 Aug 2005, at 14:51, John Gardiner Myers wrote:

> Matt Sergeant wrote:
>
>> Wasn't there unicode normalisation in the original email parser that 
>> I submitted to the project (that Theo turned into the current parser) 
>> ?
>>
>> Certainly it would make sense to use that if you could. It works very 
>> well on a very large set of test data.
>
> That code only deals with MIME-labeled charsets.  It has no provision 
> for charset detection.

Really? I must have written that later in my local version of the code. 
I can probably provide some code for charset detection - it's fairly 
simple once you have the heuristics figured out.

Matt.

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by John Gardiner Myers <jg...@proofpoint.com>.

Matt Sergeant wrote:

> Wasn't there unicode normalisation in the original email parser that I 
> submitted to the project (that Theo turned into the current parser) ?
>
> Certainly it would make sense to use that if you could. It works very 
> well on a very large set of test data.

That code only deals with MIME-labeled charsets.  It has no provision 
for charset detection.

The code puts charset normalization inside of 
Mail::SpamAssassin::Message::Node::decode().  I don't think charset 
normalization is appropriate for the decode call that is used in parsing 
message/rfc822 objects.

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Matt Sergeant <ms...@messagelabs.com>.

Wasn't there unicode normalisation in the original email parser that I 
submitted to the project (that Theo turned into the current parser) ?

Certainly it would make sense to use that if you could. It works very 
well on a very large set of test data.

Matt.


______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by John Gardiner Myers <jg...@proofpoint.com>.

Daniel Quinlan wrote:

>Just to play devil's advocate, I have one other question: would it be
>cheaper and safer to simply run tests for certain languages using
>multiple character sets?
>  
>
I'm interested in more than just tests.  I want the rendered data so I 
can do Bayes-like things with it.

I've seen Japanese spam with GB2312 encoded-words in the headers.  So 
for Japanese, you'd need a test for each of five character sets: 
iso-2022-jp, euc, shift-jis, utf-8, and gb2312.  Spammers still would 
have over five other Chinese and Korean character sets to use in order 
to hide Japanese spam from those tests.

iso-2022-jp could have obscuring escape sequences placed between any two 
characters.  Writing a test to match against encoded iso-2022-jp would 
be like sort of like trying to write a test against encoded 
quoted-printable.  Then you have potential problems with the test firing 
incorrectly because it is missing important context (like which 
character set has been selected by the last escape sequence).

>Safer: what if you guess wrong?  what if the character set is hard to
>determine correctly (intentially mixed-up, binary inserted,
>half-and-half, jumbled character sets, etc.).
>  
>
Then you have to update the code.  This is no different than MIME 
multiparts.

Re: Preliminary design proposal for charset normalization support in SpamAssassin

Posted by Daniel Quinlan <qu...@pathname.com>.

Just to play devil's advocate, I have one other question: would it be
cheaper and safer to simply run tests for certain languages using
multiple character sets?

Cheaper: is the cost really cheaper to convert?

Safer: what if you guess wrong?  what if the character set is hard to
determine correctly (intentially mixed-up, binary inserted,
half-and-half, jumbled character sets, etc.).

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/