You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/01/09 21:57:02 UTC

Re: Charset normalization issue (report, patch, and request)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


"John Myers" writes:
> I must say I was quite pleasantly surprised to find my change tested so 
> quickly during a weekend.
> 
> I don't use Bayes, so I won't be putting a lot of effort into Japanese 
> support in Bayes.  I will review your proposals:
> 
> > (1) "split word with space" (tokenization) feature.  There is no space
> >     between words in Japanese (and Chinese, Korean).  Human can
> >     understand easily but tokenization is necessary for computer
> >     processing.  There is a program called kakasi and Text:Kakasi
> >     (GPLed) which handles tokenization based on special dictionary.  I
> >     made quick hack to John's patch experimentally and tested.
> >
> >     As Kakasi does not support UTF-8, we have to convert UTF-8 to
> >     EUC-JP, process with kakasi, and then convert to UTF-8 again.  It is
> >     ugly, but it works fine.  Most word is split correctly.  The
> >     mismatch mentioned above will not occur.
> 
> It seems a bit odd to convert UTF-8 into EUC and back like this.  The 
> cost of transcoding is admittedly small compared to the cost of using 
> Perl's UTF-8 regex support for the tests, but I would suggest you 
> evaluate tokenizers that can work directly in UTF-8.  I believe MeCab is 
> one such tokenizer.
> 
> Converting UTF-8 to EUC-JP and back is problematic when the source 
> charset does not fit in EUC-JP.  Consider what would happen with Russian 
> spam, for example.  It is probably not a good idea to tokenize if the 
> message is not in CJK.
> 
> The GPL license of Kakasi and MeCab might be problematic if you want 
> tokenization support to be included in stock SpamAssassin.

For what it's worth, Kakasi looks good for tokenizing Japanese text, and
it's well-established. Given that it's pretty widely packaged (e.g., in
Debian, 'libtext-kakasi-perl'), I think it's a reasonable optional
dependency for sites who expect to see a lot of traffic in japanese
charsets.

However, I'd greatly prefer if there was some way we could detect when
Kakasi tokenization should be used, instead of doing it in all cases.   It
only deals with one language, and I can foresee a case where there are 10
different tokenizers for different Asian charsets.  

We could make it dependent on TextCat's language identification... if
language is "ja", then apply the Kakasi tokenizer, if available.

That would also help reduce the impact of transcoding EUC-JP <-> UTF-8
if it's still required.

> I believe tokenization should be done in Bayes, not in Message::Node.  I 
> believe tests should be run against the non-tokenized form.

+1 agreed.

> > (2) Raw text body is passed to Bayes tokenizer.  This causes some
> >     difficulties.
> 
> My reading of the Bayes code suggests the "visible rendered" form of the 
> body is what is passed to the Bayes tokenizer.  But then I don't use 
> Bayes so haven't seen what really happens.

Yes, that is the intent (and what happens with english text, at least).

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDws4eMJF5cimLx9ARAvTWAKCvhqD46b2DVAfPzvWFES8Q3IP4UACfY3SS
9zQVoSWom/gRxe/7Q/xAoCo=
=ol9C
-----END PGP SIGNATURE-----

Re: Charset normalization issue (report, patch, and request)

Posted by Loren Wilton <lw...@earthlink.net>.

As an outsider, I find myself strongly agreeing with Motohraru-san that,
when dealing with at least the oriental multibyte languages, tokinization
belongs early in the stream, before both bayes and rules.

Of course this is an overhead penalty that should not occur on mail that
isn't likely to be encoded in this manner.  So this should be something that
only happens in the appropriate circumstances.  Whether that is a user
option in the config, or is something that can be determined on the fly from
the charset declarations I do not know.

I would hope that the check to determine whether splitting is something that
will be done relatively infrequently, say no more than once per body section
in the mail or so, and not on a per-token basis.  Given that that is the
case, I think that splitting at the front would be appropriate for common
code in all distributions.

If the check must be repeated with great frequency, or the check is
inherently painful, then perhaps there should be two versions of the
functions making these splitting decisions, and which to use would be
conditioned by a user option.

        Loren

Re: Charset normalization issue (report, patch, and request)

Posted by Motoharu Kubo <mk...@3ware.co.jp>.

I measured how bayes database improved.  I choose 101 hams that got
BAYES_99 score with old bayes database (thus these mails were FP).  I
tested these with new bayes engine and got following result.  Test 1:
run with new database.  Test 2: sa-learned and tested again with new
database.  Test 3: re-run with old database.  New database learned 6500
spams and 12200 hams.  Old database learned 14300 spams and 250000 hams.

             Test 1    Test 2    Test 3
    BAYES_00    12        92         3
    BAYES_05     3         2         0
    BAYES_20     7         3         3
    BAYES_40     5         3         4
    BAYES_50    61         1        75
    BAYES_60    10         0         3
    BAYES_80     2         0         2
    BAYES_95     1         0        11

John Myers wrote:
> My experience shows that speed only becomes an issue when one ends up
> using Perl's UTF-8 regex engine to evaluate rules. In the case of Bayes,
> I believe correctness is more important. I would have to see a
> significant measured decrease in speed before considering sacrificing
> correctness for speed.

I made UTF-8 aware tokenize_line and measured process time of
tokenize().  Unicode-aware version took about 240ms and Bytes-oriented
version took only 9ms.

I do not have appropriate test data that \xa0 is inserted by HTML::Parse
or otehr.  My test data showed same result with two tokenize_line funcs.

> The fact that the Bayes code confuses A0 bytes in UTF-8 encoded
> characters with the U+00A0 character is one example of an issue that
> would be solved were the "use bytes" pragma removed. To be correct, the
> Bayes database should be storing all tokens in UTF-8, so they match
> regardless of how they are encoded.

Yes, adding and removing UTF-8 flag to string is necessary when go and
back between UTF-8 aware and byte oriented routines.

However, using utf8 mode, we can consider language specific aspects.
I added a code to remove one character hiragana/katakana/symbols token.


> I'm not yet convinced that tokenization belongs inside
> get_rendered_body_text_array() and
> get_visible_rendered_body_text_array(). I suspect the content preview,
> which uses get_rendered_body_text_array(), would look strange were it to
> be tokenized. I am using get_visible_rendered_body_text_array() for
> something which I'm not yet convinced needs tokenization. I think this
> area needs some field experience.

I removed splitter() and tested with e-mail which contains word split by
line break (word contains "\n").  Body test could not find this word.
I think language specific tokenization is necessary not only for bayes
but also other tests.

A friend of mine reminds me that there are second "normalization" issue
in Japanese.  There are two-byte version of alphanumeric and some symbol
characters in Japanese charsets.  We call this "zenkaku" and
corresponding 7-bit "hankaku".  Of course zenkaku version word doesn't
match hankaku version.

The following code is a part of "zenkaku-to-hankaku normalization".

  $text =~ tr/\x{ff10}-\x{ff19}/0-9/;
  $text =~ tr/\x{ff21}-\x{ff3a}/A-Z/;
  $text =~ tr/\x{ff41}-\x{ff5a}/a-z/;
  $text =~ tr/\x{2018}/`/;
  $text =~ tr/\x{2019}/'/;
        ...........

I think this normalization should be done before header and body tests.


============== patch to make tokenize_line UTF-8 aware =============
--- Bayes.pm.bytes      2006-01-14 20:50:44.000000000 +0900
+++ Bayes.pm    2006-01-14 20:51:02.000000000 +0900
@@ -342,10 +342,13 @@

   my @rettokens = ();

+  no bytes;
+  utf8::decode($_);
+
   # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam
strings,   # and ISO-8859-15 alphas.  Do not split on @'s; better
results keeping it.
   # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
-  tr/-A-Za-z0-9,\@\*\!_'"\$.\200-\377 / /cs;
+  tr/#%&()+\/:;<=>?\[\\]^`{|}~/ /s;

   # DO split on "..." or "--" or "---"; common formatting error
resulting in
   # hapaxes.  Keep the separator itself as a token, though, as long
ones can
@@ -379,19 +388,24 @@

     # but extend the stop-list. These are squarely in the gray
     # area, and it just slows us down to record them.
-    next if $len < 3 ||
-       ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
-               m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
-               t(?:his|he|ime|hrough|hat)|
-               w(?:hy|here|ork|orld|ith|ithout|eb)|
-               f(?:rom|or|ew)| e(?:ach|ven|mail)|
-               o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
-               s(?:uch|ame)| l(?:ook|ike|ong)|
-               y(?:ou|our|ou're)|
-               The|has|have|into|using|http|see|It's|it's|
-               number|just|both|come|years|right|know|already|
-               people|place|first|because|
-               And|give|year|information|can)$/x);
+    if ( $token =~ /^[\x00-\x7f]+$/ ) {
+      next if $len < 3 ||
+         ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
+                 m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
+                 t(?:his|he|ime|hrough|hat)|
+                 w(?:hy|here|ork|orld|ith|ithout|eb)|
+                 f(?:rom|or|ew)| e(?:ach|ven|mail)|
+                 o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
+                 s(?:uch|ame)| l(?:ook|ike|ong)|
+                 y(?:ou|our|ou're)|
+                 The|has|have|into|using|http|see|It's|it's|
+                 number|just|both|come|years|right|know|already|
+                 people|place|first|because|
+                 And|give|year|information|can)$/x);
+    }
+    else {
+      next if $len < 2 && $token =~
/^[\p{InHiragana}\p{InKatakana}\x{3000}-\x{303f}]+$/;
+    }

     # are we in the body?  If so, apply some body-specific breakouts
     if ($region == 1 || $region == 2) {
@@ -456,9 +470,11 @@
       }
     }

+    utf8::encode($token);
     push (@rettokens, $tokprefix.$token);
   }

+  use bytes;
   return @rettokens;
 }




-- 
----------------------------------------------------------------------
久保  元治             (株)サードウェア
Motoharu Kubo          274-0815 千葉県船橋市西習志野3-39-8
mkubo@3ware.co.jp      URL: http://www.3ware.co.jp/
                       Phone: 047-496-3341  Fax: 047-496-3370
                       携帯:  090-6171-5545/090-8513-0246
 ★弊社からのメールはZ-Linuxメールフィルタで全数検査しています★

Re: Charset normalization issue (report, patch, and request)

Posted by John Myers <jg...@proofpoint.com>.

Motoharu Kubo wrote:

>
>>The problem here is the "use bytes" pragma at the top of
>>Bayes.pm--you'll want to remove that. Removing it will have some
>>follow-on consequences--the "use bytes" pragma will probably also have
>>to be removed from BayesStore and the other Bayes-related modules. The
>>BayesStore subclasses probably will also have to be modified to become
>>UTF-8 aware, storing tokens in UTF-8 form.
>>    
>>
>
>I did not change because I think speed is another important factor for
>mail filter.
>  
>
My experience shows that speed only becomes an issue when one ends up
using Perl's UTF-8 regex engine to evaluate rules. In the case of Bayes,
I believe correctness is more important. I would have to see a
significant measured decrease in speed before considering sacrificing
correctness for speed.

The fact that the Bayes code confuses A0 bytes in UTF-8 encoded
characters with the U+00A0 character is one example of an issue that
would be solved were the "use bytes" pragma removed. To be correct, the
Bayes database should be storing all tokens in UTF-8, so they match
regardless of how they are encoded.

I'm not yet convinced that tokenization belongs inside
get_rendered_body_text_array() and
get_visible_rendered_body_text_array(). I suspect the content preview,
which uses get_rendered_body_text_array(), would look strange were it to
be tokenized. I am using get_visible_rendered_body_text_array() for
something which I'm not yet convinced needs tokenization. I think this
area needs some field experience.

Re: Charset normalization issue (report, patch, and request)

Posted by Motoharu Kubo <mk...@3ware.co.jp>.

John

Thank you very much for your help.

Amazing, Bayes score for ham drastically decreased by my patch
yesterday.  I tested the same mail text with old system and new system.
Old system returnes BAYES_99, while new system returns BAYES_00!!

Although my patch is still imcomplete and bayes db on the new system is
a mixture of old-style and new-style tokens, it is an excellent result.

Today I changed:

o to make splitter function to separate Kakasi processing and moved the
  routine from Message/Node.pm to Message.pm.  It will be easier to
  replace other program.  This function sipmply returns if text contains
  no UTF-8 data, so loss of performance will be minimized for single
  byte charsets.

  splitter is called from:
     get_rendered_body_text_array()
     get_visible_rendered_body_text_array()

o bayes tokenization for long token.  Original code cuts every two bytes
  from top of token.  As multibyte UTF-8 character has at least 3 bytes,
  I modified to cut every UTF-8 character.

  I am afraid that this change is appropriate or not.

I attached my newest patch.

> The patch you include below includes most of my change, but omits the
> following hunk. Perhaps the lack of that change is your problem?
> 
> @@ -385,7 +411,7 @@
> }
> else {
> $self->{rendered_type} = $self->{type};
> - $self->{rendered} = $text;
> + $self->{rendered} = $self->{visible_rendered} = $text;
> }
> }

My mistake.  I didn't see svn.  I included this hunk and deleted my
modificatoin.  It works fine.

> The problem here is the "use bytes" pragma at the top of
> Bayes.pm--you'll want to remove that. Removing it will have some
> follow-on consequences--the "use bytes" pragma will probably also have
> to be removed from BayesStore and the other Bayes-related modules. The
> BayesStore subclasses probably will also have to be modified to become
> UTF-8 aware, storing tokens in UTF-8 form.

I did not change because I think speed is another important factor for
mail filter.

I inserted to check if data contains UTF-8 characters but it may not be
accurate.  s/([\x20-\x7f])\xa0+([\x20-\x7f])/$1$2/g would be more
accurate when using "use bytes" pragma.

Motoharu Kubo
mkubo@3ware.co.jp

Re: Charset normalization issue (report, patch, and request)

Posted by John Myers <jg...@proofpoint.com>.

Motoharu Kubo wrote:

> However, there is another issue that I did not write so far. In
>
>Japanese and some asian language word can be split without hyphenation.
>Joining lines with space cause problem.  Not joining lines can cause
>important but undetected keyword because of line break.  I am
>considering this issue right now.
>  
>
Perhaps runs of whitespace between two CJK characters should be removed,
prior to tokenization.

>The most time consuming but accurate approach would be tokenize in
>do_body_test if language is "ja" and contents-type is "text/plain"
>  
>
I don't think you want to limit it to text/plain. Any sort of text/*
should be tokenized if it is in Japanese.

> I checked the code and found that bayes receives normalized header text
>
>and non-normalized body test.
>  
>
This doesn't match what I see. Using your test case message, I show
get_visible_rendered_body_text_array() returning the normalized form.

The patch you include below includes most of my change, but omits the
following hunk. Perhaps the lack of that change is your problem?

@@ -385,7 +411,7 @@
}
else {
$self->{rendered_type} = $self->{type};
- $self->{rendered} = $text;
+ $self->{rendered} = $self->{visible_rendered} = $text;
}
}


>In addition, \xa0 is considered as whitespace but UTF-8 can contain this
>character as second or third byte.  The tokenize_line cuts \200-\x240.
>I also changed these problems and bayes seems to receive normalized
>text.
>  
>
The problem here is the "use bytes" pragma at the top of
Bayes.pm--you'll want to remove that. Removing it will have some
follow-on consequences--the "use bytes" pragma will probably also have
to be removed from BayesStore and the other Bayes-related modules. The
BayesStore subclasses probably will also have to be modified to become
UTF-8 aware, storing tokens in UTF-8 form.

Re: Charset normalization issue (report, patch, and request)

Posted by Motoharu Kubo <mk...@3ware.co.jp>.

"John Myers" writes:
>>I must say I was quite pleasantly surprised to find my change tested so 
>>quickly during a weekend.

That's why your patch and proposal is just what I wanted and I have been
searching.

"Justin Mason" writes:
> We could make it dependent on TextCat's language identification... if
> language is "ja", then apply the Kakasi tokenizer, if available.

This is an excellent idea.

I do not think that Kakasi is the best choice, as John and you said.
MeCab has Perl interface and cah handle UTF-8.  Its license is
tentatively LGPL.  The program does not contain dictionary.  So users
have to download dictionary but it is not a problem.

>>I believe tokenization should be done in Bayes, not in Message::Node.  I 
>>believe tests should be run against the non-tokenized form.
> 
> 
> +1 agreed.

As first byte of UTF-8 always have second bit set, there is no mismatch
problem I experienced with iso-2022-jp. Thus I agree with you also.

However, there is another issue that I did not write so far.  In
Japanese and some asian language word can be split without hyphenation.
Joining lines with space cause problem.  Not joining lines can cause
important but undetected keyword because of line break.  I am
considering this issue right now.

The most time consuming but accurate approach would be tokenize in
do_body_test if language is "ja" and contents-type is "text/plain"

>>>(2) Raw text body is passed to Bayes tokenizer.  This causes some
>>>    difficulties.
>>
>>My reading of the Bayes code suggests the "visible rendered" form of the 
>>body is what is passed to the Bayes tokenizer.  But then I don't use 
>>Bayes so haven't seen what really happens.
> 
> 
> Yes, that is the intent (and what happens with english text, at least).

I checked the code and found that bayes receives normalized header text
and non-normalized body test.

If bayes should receive and can handle normalized body text
get_visible_rendered_body_text_array() should be modified.

In this function, content of text/plain part is gotten by calling
$p->decode() which returns non-normalized text.  Changing this to
"$p->rendered(); $text .= $p->{rendered};" seems to work fine.  Bayes
receives normalized text.

In addition, \xa0 is considered as whitespace but UTF-8 can contain this
character as second or third byte.  The tokenize_line cuts \200-\x240.
I also changed these problems and bayes seems to receive normalized
text.

I made a patch.  Text::Kakasi is still used in this patch, so this
patch is also an experimental.  I will test it for a while.

Any help, suggestion, objection, and warning is greatly appreciated.

======================= patch begins ===============================
diff -uNr SpamAssassin.orig/Bayes.pm SpamAssassin/Bayes.pm
--- SpamAssassin.orig/Bayes.pm	2005-08-12 09:38:47.000000000 +0900
+++ SpamAssassin/Bayes.pm	2006-01-10 22:40:14.031120448 +0900
@@ -345,7 +345,7 @@
   # include quotes, .'s and -'s for URIs, and [$,]'s for Nigerian-scam
strings,
   # and ISO-8859-15 alphas.  Do not split on @'s; better results
keeping it.
   # Some useful tokens: "$31,000,000" "www.clock-speed.net" "f*ck" "Hits!"
-  tr/-A-Za-z0-9,\@\*\!_'"\$.\241-\377 / /cs;
+  tr/-A-Za-z0-9,\@\*\!_'"\$.\200-\377 / /cs;

   # DO split on "..." or "--" or "---"; common formatting error
resulting in
   # hapaxes.  Keep the separator itself as a token, though, as long
ones can
diff -uNr SpamAssassin.orig/HTML.pm SpamAssassin/HTML.pm
--- SpamAssassin.orig/HTML.pm	2005-08-12 09:38:47.000000000 +0900
+++ SpamAssassin/HTML.pm	2006-01-10 22:39:01.662418537 +0900
@@ -742,7 +742,12 @@
     }
   }
   else {
-    $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
+    if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+      $text =~ s/[ \t\n\r\f\x0b]+/ /g;
+    }
+    else {
+      $text =~ s/[ \t\n\r\f\x0b\xa0]+/ /g;
+    }
     # trim leading whitespace if previous element was whitespace
     if (@{ $self->{text} } &&
 	defined $self->{text_whitespace} &&
diff -uNr SpamAssassin.orig/Message/Node.pm SpamAssassin/Message/Node.pm
--- SpamAssassin.orig/Message/Node.pm	2005-08-12 09:38:46.000000000 +0900
+++ SpamAssassin/Message/Node.pm	2006-01-10 22:44:27.254093218 +0900
@@ -42,6 +42,8 @@
 use Mail::SpamAssassin::HTML;
 use Mail::SpamAssassin::Logger;

+our $normalize_supported = ( $] > 5.008004 && eval 'require
Encode::Detect::Detector' && eval 'require Encode' );
+
 =item new()

 Generates an empty Node object and returns it.  Typically only called
@@ -342,6 +344,33 @@
   return 0;
 }

+sub _normalize {
+  my ($data, $charset) = @_;
+  return $data unless $normalize_supported;
+  my $detected = Encode::Detect::Detector::detect($data);
+  dbg("Detected charset ".($detected || 'none'));
+
+  my $converter;
+
+  if ($charset && ($detected || 'none') !~
/^(?:UTF|EUC|ISO-2022|Shift_JIS|Big5|GB)/i) {
+      dbg("Using labeled charset $charset");
+      $converter = Encode::find_encoding($charset);
+  }
+
+  $converter = Encode::find_encoding($detected) unless $converter ||
!defined($detected);
+
+  return $data unless $converter;
+
+  dbg("Converting...");
+
+  use Text::Kakasi;
+  my $res = Encode::encode("euc-jp",$converter->decode($data, 0));
+  my $rc  = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
+  my $str = Text::Kakasi::do_kakasi($res);
+  my $utf8= Encode::decode("euc-jp",$str);
+  return $utf8;
+}
+
 =item rendered()

 render_text() takes the given text/* type MIME part, and attempts to
@@ -359,7 +388,7 @@
   return(undef,undef) unless ( $self->{'type'} =~ /^text\b/i );

   if (!exists $self->{rendered}) {
-    my $text = $self->decode();
+    my $text = _normalize($self->decode(), $self->{charset});
     my $raw = length($text);

     # render text/html always, or any other text|text/plain part as
text/html
@@ -478,7 +507,7 @@

   if ( $cte eq 'B' ) {
     # base 64 encoded
-    return Mail::SpamAssassin::Util::base64_decode($data);
+    $data = Mail::SpamAssassin::Util::base64_decode($data);
   }
   elsif ( $cte eq 'Q' ) {
     # quoted printable
@@ -486,12 +515,13 @@
     # the RFC states that in the encoded text, "_" is equal to "=20"
     $data =~ s/_/=20/g;

-    return Mail::SpamAssassin::Util::qp_decode($data);
+    $data = Mail::SpamAssassin::Util::qp_decode($data);
   }
   else {
     # not possible since the input has already been limited to 'B' and 'Q'
     die "message: unknown encoding type '$cte' in RFC2047 header";
   }
+  return _normalize($data, $encoding);
 }

 # Decode base64 and quoted-printable in headers according to RFC2047.
@@ -505,15 +535,15 @@
   $header =~ s/\n[ \t]+/\n /g;
   $header =~ s/\r?\n//g;

-  return $header unless $header =~ /=\?/;
-
   # multiple encoded sections must ignore the interim whitespace.
   # to avoid possible FPs with (\s+(?==\?))?, look for the whole RE
   # separated by whitespace.
   1 while ($header =~
s/(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)\s+(=\?[\w_-]+\?[bqBQ]\?[^?]+\?=)/$1$2/g);

-  $header =~
-    s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2),
$3)/ge;
+  unless ($header =~
+	  s/=\?([\w_-]+)\?([bqBQ])\?([^?]+)\?=/__decode_header($1, uc($2),
$3)/ge) {
+    $header = _normalize($header);
+  }

   return $header;
 }
diff -uNr SpamAssassin.orig/Message.pm SpamAssassin/Message.pm
--- SpamAssassin.orig/Message.pm	2005-09-14 11:07:31.000000000 +0900
+++ SpamAssassin/Message.pm	2006-01-10 22:42:22.388213543 +0900
@@ -760,6 +760,7 @@
   # 0: content-type, 1: boundary, 2: charset, 3: filename
   my @ct =
Mail::SpamAssassin::Util::parse_content_type($part_msg->header('content-type'));
   $part_msg->{'type'} = $ct[0];
+  $part_msg->{'charset'} = $ct[2];

   # multipart sections are required to have a boundary set ...  If this
   # one doesn't, assume it's malformed and revert to text/plain
@@ -871,7 +872,12 @@

   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;		# double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;	# whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  }
   $text =~ tr/\f/\n/;			# form feeds => newline

   # warn "message: $text";
@@ -925,13 +931,19 @@
       }
     }
     else {
-      $text .= $p->decode();
+      $p->rendered();
+      $text .= $p->{rendered}	
     }
   }

   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;		# double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;	# whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  }
   $text =~ tr/\f/\n/;			# form feeds => newline

   my @textary = split_into_array_of_short_lines ($text);
@@ -982,7 +994,13 @@

   # whitespace handling (warning: small changes have large effects!)
   $text =~ s/\n+\s*\n+/\f/gs;		# double newlines => form feed
-  $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  if ( $text =~ /[\xc0-\xff][\x80-\xbf][\x80-\xbf]/ ) {
+    $text =~ tr/ \t\n\r\x0b/ /s;	# whitespace => space
+  }
+  else {
+    $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace => space
+  }
+  $text =~ tr/ \t\n\r\x0b/ /s;	# whitespace => space
   $text =~ tr/\f/\n/;			# form feeds => newline

   my @textary = split_into_array_of_short_lines ($text);
diff -uNr SpamAssassin.orig/Util/DependencyInfo.pm
SpamAssassin/Util/DependencyInfo.pm
--- SpamAssassin.orig/Util/DependencyInfo.pm	2005-09-14
11:07:31.000000000 +0900
+++ SpamAssassin/Util/DependencyInfo.pm	2006-01-10 22:39:01.666417637 +0900
@@ -168,6 +168,12 @@
   desc => 'The "sa-update" script requires this module to access compressed
   update archive files.',
 },
+{
+  module => 'Encode::Detect',
+  version => '0.00',
+  desc => 'If this module is installed, SpamAssassin will detect charsets
+  and convert them into Unicode.',
+},
 );

 ###########################################################################



-- 
----------------------------------------------------------------------
Motoharu Kubo
mkubo@3ware.co.jp