You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by John Gardiner Myers <jg...@proofpoint.com> on 2005/08/20 01:58:56 UTC
Preliminary design proposal for charset normalization support in
SpamAssassin
The following is a preliminary proposal for how to add support for
normalization of charsets into Perl's Unicode support. The primary
reason I want to do this work is to improve the ability of
SpamAssassin to discriminate between Japanese ham and Japanese spam.
SpamAssassin currently ignores charset information, effectively
assuming all mail is in iso-8859-1. This works for users whose ham is
encoded in iso-8859-1 and mostly works for users whose ham is encoded
in other single-byte charsets. For East Asian languages, this is
insufficient for doing text analysis.
Since a large number of SpamAssassin users are likely to be
uninterested in East Asian ham and thus unlikely to want to pay the
cost of charset normalization, the normalization support needs to be
optional, defaulting to off.
Some messages contain unlabeled charsets, others use MIME charset
labels. Some MIME charset labels are not useful
(e.g. "unknown-8bit"). To handle such nonlabeled data, it is
necessary to run a charset detector over the text in order to
determine what to convert it from. Encode::Guess effectively requires
the caller to specify the language of the text, so I consider it too
simplistic. Better would be Mozilla's universal charset detector,
which I would have to wrap up as a cpan module.
It is common for Korean messages to have an incorrect MIME label of
"iso-8859-1", so it may be necessary to run a charset detector even
over MIME-labeled charsets.
After the charset has been determined, either from the MIME label or
the charset detector, the data needs to be converted from that charset
to Perl's internal utf8 form. Encode::decode() is the obvious choice
for this, though I can see reasons why an installation might want to
be able to replace the charset converters with some other
implementation.
The following functions, immediately after they all
Mail::SpamAssassin::Message::Node::decode, need to call a
function that does charset normalization.
* Mail::SpamAssassin::Message::get_rendered_body_text_array
* Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
* Mail::SpamAssassin::Message::get_decoded_body_text_array
Furthermore:
* Mail::SpamAssassin::Message::Node::_decode_header
* Mail::SpamAssassin::Message::Node::__decode_header
also need to call a function to do charset normalization.
_decode_header for unlabeled charset data, __decode_header for for
MIME encoded-words.
This new charset normalization function will take as arguments the
text and any MIME charset label. The function calls the charset
detector and converter as necessary and returns the normalized text in
Perl's internal form. The returned text will only have the utf8 flag
set if the input charset was not us-ascii or iso-8859-1.
This new charset normalization function should most likely use a
plugin callback to do all the work, though it only makes sense for one
loaded plugin to implement the callback. If no plugin implements the
callback, then it should simply return the input text, preserving the
current behavior.
The other issue is that Mail::SpamAssassin::HTML uses two calls to
pack("C0A*", ...) in order to strip Perl's utf-8 flag from text going
into and out of HTML::Parser. When doing charset normalization, these
two pack calls need to be removed. In order for HTML::Parser to
correctly handle utf8, one needs minimum versions of Perl 5.8 and
HTML::Parser 3.39_90. HTML::Parser 3.43 might be a better minimum
version--I haven't reviewed the severity of the utf8 bug fixed in that
release. I see two possibilities:
1) Condition the two pack calls on version checks: (perl < 5.8 ||
HTML::Parser < 3.43)
2) Condition the two pack calls on charset normalization disabled.
Comments?
Re: Preliminary design proposal for charset normalization support
in SpamAssassin
Posted by John Gardiner Myers <jg...@proofpoint.com>.
Daniel Quinlan wrote:
>What do you estimate the overhead would be?
>
>
Hard to estimate without some design choices, like when exactly to run
the charset detector. The charset detector runs about 10-20 state
machines over the text, in parallel. The conversion itself is another
pass and another copy of the text. When the text has characters outside
of iso-8859-1, one then has to pay the cost of Perl's Unicode regex
support for each of the rules. That means the cost will depend on the
percentage of non-iso-8859-1 messages in the message stream.
>What is the license of [Mozilla's universal charset detector]?
>
MPL.
>We can probably safely up the requirement for HTML::Parser in our next
>major revision. Conditioning is also okay.
>
>
Without the second pack call, any non-iso-8859-1 character entities will
cause the output to have the utf-8 bit set and thus engage Perl's
Unicode regex support.
>We should pay special attention to behaving as MUAs. I believe some
>MUAs will actually ignore the MIME character set and use the one
>specified in the message HTML (if it is HTML). We shouldn't necessarily
>assume all MUAs have been configured to use the local character set at
>all times.
>
>
Something to look into. One would have to pre-parse the HTML to see if
there is a charset label.
Re: Preliminary design proposal for charset normalization support in SpamAssassin
Posted by Faisal N Jawdat <fa...@faisal.com>.
Apparently I'm on crack (although I'm not the only one) about it
already being required. Was it always required and I just didn't
notice, or did I just miss the announcement that it was now
required? In any case, I agree that we should rev the requirement.
> We already require HTML::Parser. The only question is whether or
> not we
> should raise the minimum required version. I suggested that we
> raise it
> and I doubt anyone will have major objections since we haven't raised
> the minimum version for a while.
FWIW, CPAN doesn't even list 3.24 on the base page any more....
http://search.cpan.org/dist/HTML-Parser/
-faisal
Re: Preliminary design proposal for charset normalization support in SpamAssassin
Posted by Daniel Quinlan <qu...@pathname.com>.
Faisal N Jawdat <fa...@faisal.com> writes:
>> We can probably safely up the requirement for HTML::Parser in our next
>> major revision. Conditioning is also okay.
> What are reasons against doing this? I've sat through the "What is
> the regexp for matching this HTML pattern", "Don't use a regexp to
> parse HTML, use a real parser", "Here, HTML::Parser", "Wait, that's
> an optional install -- we can't rely on that" conversation at least 4
> times *in the last 10 days*.
I have no idea what your point is.
We already require HTML::Parser. The only question is whether or not we
should raise the minimum required version. I suggested that we raise it
and I doubt anyone will have major objections since we haven't raised
the minimum version for a while.
Daniel
--
Daniel Quinlan
http://www.pathname.com/~quinlan/
Re: Preliminary design proposal for charset normalization support in SpamAssassin
Posted by Faisal N Jawdat <fa...@faisal.com>.
> We can probably safely up the requirement for HTML::Parser in our next
> major revision. Conditioning is also okay.
What are reasons against doing this? I've sat through the "What is
the regexp for matching this HTML pattern", "Don't use a regexp to
parse HTML, use a real parser", "Here, HTML::Parser", "Wait, that's
an optional install -- we can't rely on that" conversation at least 4
times *in the last 10 days*.
-faisal
Re: Preliminary design proposal for charset normalization support in SpamAssassin
Posted by Daniel Quinlan <qu...@pathname.com>.
John Gardiner Myers <jg...@proofpoint.com> writes:
> the normalization support needs to be optional, defaulting to off.
What do you estimate the overhead would be?
> Better would be Mozilla's universal charset detector, which I would
> have to wrap up as a cpan module.
What is the license of it? We try to avoid requiring additional
external CPAN modules. We might want to ship with it... if the license
(and ASF policy) allows.
> The other issue is that Mail::SpamAssassin::HTML uses two calls to
> pack("C0A*", ...) in order to strip Perl's utf-8 flag from text going
> into and out of HTML::Parser. When doing charset normalization, these
> two pack calls need to be removed. In order for HTML::Parser to
> correctly handle utf8, one needs minimum versions of Perl 5.8 and
> HTML::Parser 3.39_90. HTML::Parser 3.43 might be a better minimum
> version--I haven't reviewed the severity of the utf8 bug fixed in that
> release. I see two possibilities:
>
> 1) Condition the two pack calls on version checks: (perl < 5.8 ||
> HTML::Parser < 3.43)
>
> 2) Condition the two pack calls on charset normalization disabled.
We can probably safely up the requirement for HTML::Parser in our next
major revision. Conditioning is also okay.
We should pay special attention to behaving as MUAs. I believe some
MUAs will actually ignore the MIME character set and use the one
specified in the message HTML (if it is HTML). We shouldn't necessarily
assume all MUAs have been configured to use the local character set at
all times.
Daniel
--
Daniel Quinlan
http://www.pathname.com/~quinlan/
Re: Preliminary design proposal for charset normalization support
in SpamAssassin
Posted by John Gardiner Myers <jg...@proofpoint.com>.
Loren Wilton wrote:
>This sounds like it might be possible to decode the entire body text at
>least three times per message. [...] A common function might be able to
>decode once and cache the decoded text for the next call.
>
>
I did plan on having the newly created normalization cache the result
just like Mail::SpamAssassin::Message::Node::decode does.
>
>
>>* Mail::SpamAssassin::Message::Node::_decode_header
>>* Mail::SpamAssassin::Message::Node::__decode_header
>>
>>
>This again sounds like it might be possibly to decode many times per
>message?
>
>
No, these are only called upon construction.
Re: Preliminary design proposal for charset normalization support in SpamAssassin
Posted by Loren Wilton <lw...@earthlink.net>.
> The following functions, immediately after they all
> Mail::SpamAssassin::Message::Node::decode, need to call a
> function that does charset normalization.
>
> * Mail::SpamAssassin::Message::get_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_visible_rendered_body_text_array
> * Mail::SpamAssassin::Message::get_decoded_body_text_array
This sounds like it might be possible to decode the entire body text at
least three times per message. This would likely be more overhead than
decoding it once, if that is possible. Perhaps there is (or could be) a
common function all of these would call to get text, or a common data
repository, that has been decoded. A common function might be able to
decode once and cache the decoded text for the next call.
> * Mail::SpamAssassin::Message::Node::_decode_header
> * Mail::SpamAssassin::Message::Node::__decode_header
>
> also need to call a function to do charset normalization.
> _decode_header for unlabeled charset data, __decode_header for for
> MIME encoded-words.
This again sounds like it might be possibly to decode many times per
message?
Loren
Re: Preliminary design proposal for charset normalization support in SpamAssassin
Posted by Matt Sergeant <ms...@messagelabs.com>.
On 23 Aug 2005, at 14:51, John Gardiner Myers wrote:
> Matt Sergeant wrote:
>
>> Wasn't there unicode normalisation in the original email parser that
>> I submitted to the project (that Theo turned into the current parser)
>> ?
>>
>> Certainly it would make sense to use that if you could. It works very
>> well on a very large set of test data.
>
> That code only deals with MIME-labeled charsets. It has no provision
> for charset detection.
Really? I must have written that later in my local version of the code.
I can probably provide some code for charset detection - it's fairly
simple once you have the heuristics figured out.
Matt.
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
Re: Preliminary design proposal for charset normalization support
in SpamAssassin
Posted by John Gardiner Myers <jg...@proofpoint.com>.
Matt Sergeant wrote:
> Wasn't there unicode normalisation in the original email parser that I
> submitted to the project (that Theo turned into the current parser) ?
>
> Certainly it would make sense to use that if you could. It works very
> well on a very large set of test data.
That code only deals with MIME-labeled charsets. It has no provision
for charset detection.
The code puts charset normalization inside of
Mail::SpamAssassin::Message::Node::decode(). I don't think charset
normalization is appropriate for the decode call that is used in parsing
message/rfc822 objects.
Re: Preliminary design proposal for charset normalization support in SpamAssassin
Posted by Matt Sergeant <ms...@messagelabs.com>.
Wasn't there unicode normalisation in the original email parser that I
submitted to the project (that Theo turned into the current parser) ?
Certainly it would make sense to use that if you could. It works very
well on a very large set of test data.
Matt.
______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________
Re: Preliminary design proposal for charset normalization support
in SpamAssassin
Posted by John Gardiner Myers <jg...@proofpoint.com>.
Daniel Quinlan wrote:
>Just to play devil's advocate, I have one other question: would it be
>cheaper and safer to simply run tests for certain languages using
>multiple character sets?
>
>
I'm interested in more than just tests. I want the rendered data so I
can do Bayes-like things with it.
I've seen Japanese spam with GB2312 encoded-words in the headers. So
for Japanese, you'd need a test for each of five character sets:
iso-2022-jp, euc, shift-jis, utf-8, and gb2312. Spammers still would
have over five other Chinese and Korean character sets to use in order
to hide Japanese spam from those tests.
iso-2022-jp could have obscuring escape sequences placed between any two
characters. Writing a test to match against encoded iso-2022-jp would
be like sort of like trying to write a test against encoded
quoted-printable. Then you have potential problems with the test firing
incorrectly because it is missing important context (like which
character set has been selected by the last escape sequence).
>Safer: what if you guess wrong? what if the character set is hard to
>determine correctly (intentially mixed-up, binary inserted,
>half-and-half, jumbled character sets, etc.).
>
>
Then you have to update the code. This is no different than MIME
multiparts.
Re: Preliminary design proposal for charset normalization support in SpamAssassin
Posted by Daniel Quinlan <qu...@pathname.com>.
Just to play devil's advocate, I have one other question: would it be
cheaper and safer to simply run tests for certain languages using
multiple character sets?
Cheaper: is the cost really cheaper to convert?
Safer: what if you guess wrong? what if the character set is hard to
determine correctly (intentially mixed-up, binary inserted,
half-and-half, jumbled character sets, etc.).
Daniel
--
Daniel Quinlan
http://www.pathname.com/~quinlan/