You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Motoharu Kubo <mk...@3ware.co.jp> on 2006/01/08 18:07:31 UTC
Charset normalization issue (report, patch, and request)
This is my first post to the list. I would like to report my test
result of charset normalization patch. In addition, I would request
for tokenization (with experimental patch) and Bayes mechanism in order
to improve Japanese support.
At first, let me introduce myself briefly. I am an native Japanese who
lives in Japan. My company offers commercial support for spam/virus
filter with SpamAssassin, amavisd-new, and Maia Mailguard. I have been
using SA for more than two years. It works great but there are two
important problem for Japanese handling.
(1) It is very hard to maintain rules for Japanese word because there is
some charsets (iso-2022-jp, shift-jis, utf-8, euc-jp) and charset
normalization is not built-in yet. So I have to write hex pattern
for individual charsets.
In additon, pattern match sometimes fail. The pattern /$C$$/
matches certain word as expected, but it can match different word
with pattern /$$C$$A/.
I am welcome to the normalization patch because I will be able to
write rules with UTF-8 and many of this type of mismatch will be
resolved.
Today I tested the patch and it works great. It could normalize
iso-2022-jp, shift-jis, and utf-8 text body as well as MIME (base64)
encoded header text. In addition, it could normalize incorrect MIME
encoded header (with charset shift-jis but actually iso-2022-jp
text). I modified ruleset for Japanese word with UTF-8. These
rules matched as expected. I think more test by many people would
be necessary but I would strongly request this patch to be included
officially in next release. Thanks to John!
(2) The bayes database contains many meaningless tokens for text body.
As a result I feel it is unstable and new mail tends to be
considered as spam.
For example, the line (with iso-2022-jp)
{ESC}$BM5J!$J?M:J%;%U%lC5$7$N7hDjHG!*{ESC}(B
is tokenized to
BM5JJ $bm5j!$j bm5jj $BM5J!$J ...
o As "{ESC}$B" is an leading escape sequence, this should be
ignored. The first meaningful token should begin with "M5".
o Each Japanese character needs 2-bytes. Thus odd-byte token is
meaningless. "BM5JJ" should be avoided.
o "$bm5j!$j" (converted to lower case) corresponds to different
characters.
With Shift-JIS each Japanese character beging with 8-bit chars
followed by 7-bit or 8-bit chars follows. Most information (8-bit
chars) are lost.
I think two more enhancements are necessary to improve Japanese support.
(1) "split word with space" (tokenization) feature. There is no space
between words in Japanese (and Chinese, Korean). Human can
understand easily but tokenization is necessary for computer
processing. There is a program called kakasi and Text:Kakasi
(GPLed) which handles tokenization based on special dictionary. I
made quick hack to John's patch experimentally and tested.
As Kakasi does not support UTF-8, we have to convert UTF-8 to
EUC-JP, process with kakasi, and then convert to UTF-8 again. It is
ugly, but it works fine. Most word is split correctly. The
mismatch mentioned above will not occur.
As spam in Japanese is increasing, supporting this kind of native
language support would be great.
Index: lib/Mail/SpamAssassin/Message/Node.pm
===================================================================
--- Node.pm 2006-01-08 22:31:30.497174000 +0900
+++ Node.pm.new 2006-01-08 22:33:34.000000000 +0900
@@ -363,7 +363,15 @@
dbg("Converting...");
- return $converter->decode($data, 0);
+use Text::Kakasi;
+
+ my $res = Encode::encode("euc-jp",$converter->decode($data, 0));
+ my $rc = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
+ my $str = Text::Kakasi::do_kakasi($res);
+
+#dbg( "Kakasi: $str");
+ return Encode::decode("euc-jp",$str);
+# return $converter->decode($data, 0);
}
=item rendered()
===================================================================
(2) Raw text body is passed to Bayes tokenizer. This causes some
difficulties.
For example, the word "Judo" is represented by two Kanji characters.
It is encoded as:
{ESC}$B=@F;{ESC}(B ISO-2022-JP
0x8f 0x5f 0x93 0xb9 Shift-JIS
0xe6 0x9f 0x94 0xe9 0x81 0x93 UTF-8
0xbd 0xc0 0xc6 0xbb EUC-JP
Thus (if token is not lost) many records for the same word is
registered and it lowers the efficacy.
For ISO-2022-JP encoding, there is another prolem. We use hiragana
and katakana so often (about 30 to 70% of chars used are hiragana or
katakana). These characters are mapped to lower area of 7-bit pace.
Following pattern is an example actually exists.
{ESC}$B$3$N$3$H$K$D$$$F$O!"{ESC}(B
Every two bytes just after starting escape sequence ({ESC}$B)
corresponds to Japanese character ($3, $N, $3, $H, and so on).
As mentioned above, current Base tokenizer can not handle our
charsets well. Only ascii words such as URIs, some technical word
(such as Linux, Windows,...) are registered and other useful words
are dropped. Instead meaningless tokens are registered which would
be noize.
I think if Bayes could accept normalized and tokenized body (based
on dictionary) and could handle 8-bit portion well, we could improve
the effectivenss ot it for Japanese.
Any comment, suggestion, info is greatly appreciated.
--
Motoharu Kubo
mkubo@3ware.co.jp
Re: Charset normalization issue (report, patch, and request)
Posted by Motoharu Kubo <mk...@3ware.co.jp>.
Motoharu Kubo wrote:
> The result is almost equal but slightly different for Japanese. Mecab
> is more sophisticated than Kakasi. However Mecab is clever enough to
> split english word such like "EUC_JP" to "UEC _ JP",
> "http://www.yahoo.com/" to "http :// www . yahoo . com /" Because URL
> and e-mail address is an important signature, it may be problematic.
This problem is solved using newest version and modifying dictionary.
MeCab now keeps single byte sequence as is.
--
----------------------------------------------------------------------
Motoharu Kubo
mkubo@3ware.co.jp
Re: Charset normalization issue (report, patch, and request)
Posted by Motoharu Kubo <mk...@3ware.co.jp>.
> It seems a bit odd to convert UTF-8 into EUC and back like this. The
> cost of transcoding is admittedly small compared to the cost of using
> Perl's UTF-8 regex support for the tests, but I would suggest you
> evaluate tokenizers that can work directly in UTF-8. I believe MeCab is
> one such tokenizer.
I tried MeCab today. It works fine. I changed code from:
use Text::Kakasi;
my $res = Encode::encode("euc-jp",Encode::decode("utf8",$text));
my $rc = Text::Kakasi::getopt_argv('kakasi','-ieuc','-w');
my $str = Text::Kakasi::do_kakasi($res);
$utf8= Encode::decode("euc-jp",$str);
to
use MeCab;
my @arg = ('dummy', '-Owakati');
my $mecab = new MeCab::Tagger (\@arg);
$utf8 = $mecab->parse ($text);
I compiled Mecab with --with-charset=utf8 option. No charset conversion
is necessary. There is no distinct difference in process time but Mecab
is slightly faster than kakasi.
The result is almost equal but slightly different for Japanese. Mecab
is more sophisticated than Kakasi. However Mecab is clever enough to
split english word such like "EUC_JP" to "UEC _ JP",
"http://www.yahoo.com/" to "http :// www . yahoo . com /" Because URL
and e-mail address is an important signature, it may be problematic.
I will ask developer if we can avoid splitting for URL etc.
--
Motoharu Kubo
mkubo@3ware.co.jp
Re: Charset normalization issue (report, patch, and request)
Posted by John Myers <jg...@proofpoint.com>.
I must say I was quite pleasantly surprised to find my change tested so
quickly during a weekend.
I don't use Bayes, so I won't be putting a lot of effort into Japanese
support in Bayes. I will review your proposals:
> (1) "split word with space" (tokenization) feature. There is no space
> between words in Japanese (and Chinese, Korean). Human can
> understand easily but tokenization is necessary for computer
> processing. There is a program called kakasi and Text:Kakasi
> (GPLed) which handles tokenization based on special dictionary. I
> made quick hack to John's patch experimentally and tested.
>
> As Kakasi does not support UTF-8, we have to convert UTF-8 to
> EUC-JP, process with kakasi, and then convert to UTF-8 again. It is
> ugly, but it works fine. Most word is split correctly. The
> mismatch mentioned above will not occur.
It seems a bit odd to convert UTF-8 into EUC and back like this. The
cost of transcoding is admittedly small compared to the cost of using
Perl's UTF-8 regex support for the tests, but I would suggest you
evaluate tokenizers that can work directly in UTF-8. I believe MeCab is
one such tokenizer.
Converting UTF-8 to EUC-JP and back is problematic when the source
charset does not fit in EUC-JP. Consider what would happen with Russian
spam, for example. It is probably not a good idea to tokenize if the
message is not in CJK.
The GPL license of Kakasi and MeCab might be problematic if you want
tokenization support to be included in stock SpamAssassin.
I believe tokenization should be done in Bayes, not in Message::Node. I
believe tests should be run against the non-tokenized form.
>
> (2) Raw text body is passed to Bayes tokenizer. This causes some
> difficulties.
My reading of the Bayes code suggests the "visible rendered" form of the
body is what is passed to the Bayes tokenizer. But then I don't use
Bayes so haven't seen what really happens.