You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Motoharu Kubo <mk...@3ware.co.jp> on 2006/01/08 18:07:31 UTC

Charset normalization issue (report, patch, and request)

This is my first post to the list.  I would like to report my test 
result of charset normalization patch.  In addition, I would request
for tokenization (with experimental patch) and Bayes mechanism in order 
to improve Japanese support.

At first, let me introduce myself briefly.  I am an native Japanese who 
lives in Japan.  My company offers commercial support for spam/virus 
filter with SpamAssassin, amavisd-new, and Maia Mailguard.  I have been 
using SA for more than two years.  It works great but there are two 
important problem for Japanese handling.

(1) It is very hard to maintain rules for Japanese word because there is
     some charsets (iso-2022-jp, shift-jis, utf-8, euc-jp) and charset
     normalization is not built-in yet.  So I have to write hex pattern
     for individual charsets.

     In additon, pattern match sometimes fail.  The pattern /$C$$/
     matches certain word as expected, but it can match different word
     with pattern /$$C$$A/.

     I am welcome to the normalization patch because I will be able to
     write rules with UTF-8 and many of this type of mismatch will be
     resolved.

     Today I tested the patch and it works great.  It could normalize
     iso-2022-jp, shift-jis, and utf-8 text body as well as MIME (base64)
     encoded header text.  In addition, it could normalize incorrect MIME
     encoded header (with charset shift-jis but actually iso-2022-jp
     text).  I modified ruleset for Japanese word with UTF-8.  These
     rules matched as expected.  I think more test by many people would
     be necessary but I would strongly request this patch to be included
     officially in next release.  Thanks to John!

(2) The bayes database contains many meaningless tokens for text body.
     As a result I feel it is unstable and new mail tends to be
     considered as spam.

     For example, the line (with iso-2022-jp)

     {ESC}$BM5J!$J?M:J%;%U%lC5$7$N7hDjHG!*{ESC}(B

     is tokenized to

     BM5JJ $bm5j!$j bm5jj $BM5J!$J ...

     o As "{ESC}$B" is an leading escape sequence, this should be
       ignored.  The first meaningful token should begin with "M5".

     o Each Japanese character needs 2-bytes.  Thus odd-byte token is
       meaningless.  "BM5JJ" should be avoided.

     o "$bm5j!$j" (converted to lower case) corresponds to different
       characters.

     With Shift-JIS each Japanese character beging with 8-bit chars
     followed by 7-bit or 8-bit chars follows.  Most information (8-bit
     chars) are lost.

I think two more enhancements are necessary to improve Japanese support.

(1) "split word with space" (tokenization) feature.  There is no space
     between words in Japanese (and Chinese, Korean).  Human can
     understand easily but tokenization is necessary for computer
     processing.  There is a program called kakasi and Text:Kakasi
     (GPLed) which handles tokenization based on special dictionary.  I
     made quick hack to John's patch experimentally and tested.

     As Kakasi does not support UTF-8, we have to convert UTF-8 to
     EUC-JP, process with kakasi, and then convert to UTF-8 again.  It is
     ugly, but it works fine.  Most word is split correctly.  The
     mismatch mentioned above will not occur.

     As spam in Japanese is increasing, supporting this kind of native
     language support would be great.

Index: lib/Mail/SpamAssassin/Message/Node.pm
===================================================================
--- Node.pm     2006-01-08 22:31:30.497174000 +0900
+++ Node.pm.new 2006-01-08 22:33:34.000000000 +0900
@@ -363,7 +363,15 @@

    dbg("Converting...");

-  return $converter->decode($data, 0);
+use Text::Kakasi;
+
+  my $res = Encode::encode("euc-jp",$converter->decode($data, 0));
+  my $rc  = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
+  my $str = Text::Kakasi::do_kakasi($res);
+
+#dbg( "Kakasi: $str");
+  return Encode::decode("euc-jp",$str);
+#  return $converter->decode($data, 0);
  }

  =item rendered()
===================================================================

(2) Raw text body is passed to Bayes tokenizer.  This causes some
     difficulties.

     For example, the word "Judo" is represented by two Kanji characters.
     It is encoded as:

       {ESC}$B=@F;{ESC}(B             ISO-2022-JP
       0x8f 0x5f 0x93 0xb9            Shift-JIS
       0xe6 0x9f 0x94 0xe9 0x81 0x93  UTF-8
       0xbd 0xc0 0xc6 0xbb            EUC-JP

     Thus (if token is not lost) many records for the same word is
     registered and it lowers the efficacy.

     For ISO-2022-JP encoding, there is another prolem.  We use hiragana
     and katakana so often (about 30 to 70% of chars used are hiragana or
     katakana).  These characters are mapped to lower area of 7-bit pace.
     Following pattern is an example actually exists.

       {ESC}$B$3$N$3$H$K$D$$$F$O!"{ESC}(B

     Every two bytes just after starting escape sequence ({ESC}$B)
     corresponds to Japanese character ($3, $N, $3, $H, and so on).

     As mentioned above, current Base tokenizer can not handle our
     charsets well.  Only ascii words such as URIs, some technical word
     (such as Linux, Windows,...) are registered and other useful words
     are dropped.  Instead meaningless tokens are registered which would
     be noize.

     I think if Bayes could accept normalized and tokenized body (based
     on dictionary) and could handle 8-bit portion well, we could improve
     the effectivenss ot it for Japanese.

Any comment, suggestion, info is greatly appreciated.

-- 
Motoharu Kubo
mkubo@3ware.co.jp

Re: Charset normalization issue (report, patch, and request)

Posted by Motoharu Kubo <mk...@3ware.co.jp>.
Motoharu Kubo wrote:
> The result is almost equal but slightly different for Japanese.  Mecab 
> is more sophisticated than Kakasi.  However Mecab is clever enough to 
> split english word such like "EUC_JP" to "UEC _ JP",
> "http://www.yahoo.com/" to "http :// www . yahoo . com /"  Because URL 
> and e-mail address is an important signature, it may be problematic.

This problem is solved using newest version and modifying dictionary.
MeCab now keeps single byte sequence as is.

-- 
----------------------------------------------------------------------
Motoharu Kubo
mkubo@3ware.co.jp

Re: Charset normalization issue (report, patch, and request)

Posted by Motoharu Kubo <mk...@3ware.co.jp>.
> It seems a bit odd to convert UTF-8 into EUC and back like this.  The 
> cost of transcoding is admittedly small compared to the cost of using 
> Perl's UTF-8 regex support for the tests, but I would suggest you 
> evaluate tokenizers that can work directly in UTF-8.  I believe MeCab is 
> one such tokenizer.

I tried MeCab today.  It works fine.  I changed code from:

   use Text::Kakasi;
   my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
   my $rc  = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
   my $str = Text::Kakasi::do_kakasi($res);
   $utf8= Encode::decode("euc-jp",$str);

to

   use MeCab;
   my @arg = ('dummy', '-Owakati');
   my $mecab = new MeCab::Tagger (\@arg);
   $utf8 = $mecab->parse ($text);

I compiled Mecab with --with-charset=utf8 option.  No charset conversion 
is necessary.  There is no distinct difference in process time but Mecab 
is slightly faster than kakasi.

The result is almost equal but slightly different for Japanese.  Mecab 
is more sophisticated than Kakasi.  However Mecab is clever enough to 
split english word such like "EUC_JP" to "UEC _ JP",
"http://www.yahoo.com/" to "http :// www . yahoo . com /"  Because URL 
and e-mail address is an important signature, it may be problematic.

I will ask developer if we can avoid splitting for URL etc.

-- 
Motoharu Kubo
mkubo@3ware.co.jp

Re: Charset normalization issue (report, patch, and request)

Posted by John Myers <jg...@proofpoint.com>.
I must say I was quite pleasantly surprised to find my change tested so 
quickly during a weekend.

I don't use Bayes, so I won't be putting a lot of effort into Japanese 
support in Bayes.  I will review your proposals:

> (1) "split word with space" (tokenization) feature.  There is no space
>     between words in Japanese (and Chinese, Korean).  Human can
>     understand easily but tokenization is necessary for computer
>     processing.  There is a program called kakasi and Text:Kakasi
>     (GPLed) which handles tokenization based on special dictionary.  I
>     made quick hack to John's patch experimentally and tested.
>
>     As Kakasi does not support UTF-8, we have to convert UTF-8 to
>     EUC-JP, process with kakasi, and then convert to UTF-8 again.  It is
>     ugly, but it works fine.  Most word is split correctly.  The
>     mismatch mentioned above will not occur.

It seems a bit odd to convert UTF-8 into EUC and back like this.  The 
cost of transcoding is admittedly small compared to the cost of using 
Perl's UTF-8 regex support for the tests, but I would suggest you 
evaluate tokenizers that can work directly in UTF-8.  I believe MeCab is 
one such tokenizer.

Converting UTF-8 to EUC-JP and back is problematic when the source 
charset does not fit in EUC-JP.  Consider what would happen with Russian 
spam, for example.  It is probably not a good idea to tokenize if the 
message is not in CJK.

The GPL license of Kakasi and MeCab might be problematic if you want 
tokenization support to be included in stock SpamAssassin.

I believe tokenization should be done in Bayes, not in Message::Node.  I 
believe tests should be run against the non-tokenized form.

>
> (2) Raw text body is passed to Bayes tokenizer.  This causes some
>     difficulties.

My reading of the Bayes code suggests the "visible rendered" form of the 
body is what is passed to the Bayes tokenizer.  But then I don't use 
Bayes so haven't seen what really happens.