You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Motoharu Kubo <mk...@3ware.co.jp> on 2006/01/15 03:00:11 UTC

I18n and l10n (Re: Charset normalization issue (report, patch, and request))

I changed Subject.

Justin Mason wrote:
> I'm not sure I understand why.
> 
> Currently, Bayes is the only code that actually *uses* knowledge of how a
> string is tokenized into words; this isn't exposed to the rules at all.
> 
> If it should be, that's an entirely separate feature request. ;)

Thanks to Justin.  This is an important suggestion to me.

Yes, what I am saying is not only a "Charset" normalization issue.  It
should be called "I18n and l10n" issue.  And it includes charset
normalization issue.

This is why I impose and insist om my splitter function (the name of
this function may not be appropriate).

John's proposal and patch is a great step to start I18n for me.  This
could solve our daily headache and frustration.

However, there is language specific "normalization" issues as I
explained in previous messages.  It might be called l10n issue, in
short.  Japanese language permit that word can be split by LF.  There is
no space between words.  There is some "alias" (there is zenkaku
character and hankaku character with same glyph).  These features are
different from Western language and special handling is necessary before
not only bayes tokenization but body/header tests.

We, Japanese are localizing some application such as browser, word,
spread sheet, and it is maintained separately.  It is because Japanese
version is mainly used by ourselves.  However, e-mail is not bound to
country.  We receive English, Chinese, Hanguel and Japanese spam daily.
 This is why I think I should propose l10n issue at SA's dev list.

I only know Japanese specific issue and I am not sure what specific
issue exist in other language.  So what I could implement is Japanese
issue only.

Based on this consideration the name "splitter" should be renamed to
reflect more comprehensive, intuitive name and it should receive
TextCat's result and original charset information.  This way language
specific process other than Japanese can be written.

----------------------------------------------------------------------
Motoharu Kubo
mkubo@3ware.co.jp

Re: I18n and l10n

Posted by MATSUDA Yoh-ichi <yo...@flcl.org>.
Hello.

From: Motoharu Kubo <mk...@3ware.co.jp>
Subject: Re: I18n and l10n
Date: Tue, 17 Jan 2006 02:22:24 +0900

> MATSUDA Yoh-ichi wrote:
> > Is the above flow drawing correct or wrong?
> > And, John-san and Motoharu-san's patches are:
> > 
> >                                                            |    |
> >               +--------------------------------------------+    |
> >               |                           (NEW!)                V
> >               +-> converting html -> UTF-8 character -> [body]->+
> >                   to plain text        normalization            |
> >               +-------------------------------------------------+
> >               |      (NEW!)
> >               +-> tokenization -> [bayes]
> >                    by Mecab
> 
> My opinion is to tokenize just after charset normalization.
> 
> UTF-8 character -> tokenization -> [body]
> normalization
> 
> I wrote the reason why I insist on this flow several times.  In short,
> 
> (1) to join word separated by line break (eg. "a\nb" to "ab" if "ab" is
>      the word)
> (2) to clarify word boundary (eg. "youwon" -> "you won")
> 
> > Many Japanese spams are written in Shift-JIS codeset.
> > Shift-JIS detecting rule is convenience.
> 
> My opinion is yes and no.
> 
> - There are many SJIS spams but also many iso-2022-jp encoded spams.

Yes.

> - All SJIS mails are not spams.  A careless alert mail sent from Windows
>    application is also SJIS encoded (without base64/quoted-printable
>    encoding).

Yes.

> - There might be some tendency or difference between SJIS spam and
>    iso-2022-jp spam but not so significant, I think.

I think that is yes and no.

All SJIS mails are not spam, but 'spam probability' is high.

For example, the case that one received mail characterized:

(a) SJIS encoded
(b) came from Brazil, Mexico, Russia, Romania, ...
(c) dynamic address
(d) Razor2 registered
(e) BAYES_99

Only (a), we can't recognize whether the mail is spam or ham.
But, (a) and (b), spam probability is higher than only (a).
Also, (a), (b) and (c) is higher than (a) and (b).
Also, (a), (b), (c) and (d) is higher than...

SA has meta rule for the above situation.

All rules are probability.
So, SA calcurates probability.

> - Writing rule with hex notation is troublesome, boaring and decreases
>    productivity.  If we could normalize charset, we could write rule
>    directly with UTF-8 aware editor.

Yes.
Directly writing REGEX rule with UTF-8 character is very convenience.
But I think character normalization and tokenization before body
testing is troublesome.
Because, character normalization and tokenization is modifying
message text, so REGEX rule writer can't recognize against the
modified text.

Many rules are written for pure plain message text.
If character normalization and tokenization are inserted before body
testing, many body rules will be unavailable.

So,

> > But, if the character normalization will insert before body testing,
> > my rule will be unavailable.
> > 
> > Do I have to re-write the above 2 rules from [body] to [rawbody]?
> 
> There are two possibilities.
> 
> (1) rewrite from BODY to RAWBODY as Matsuda-san says.
> (2) invent NBODY (or something else) apart from BODY.  NBODY contains
>      normalized and tokenized version of body.  I once thought of this
>      idea but did not propose because BODY has problems I mentioned
>      above and overhead of executing nbody_test increases.

I want (2), for the reason of compatibility of rules.
--
MATSUDA Yoh-ichi(yoh)
mailto:yoh@flcl.org
http://www.flcl.org/~yoh/diary/

Re: I18n and l10n

Posted by Motoharu Kubo <mk...@3ware.co.jp>.
> There are two possibilities.
> 
> (1) rewrite from BODY to RAWBODY as Matsuda-san says.
> (2) invent NBODY (or something else) apart from BODY.  NBODY contains
>      normalized and tokenized version of body.  I once thought of this
>      idea but did not propose because BODY has problems I mentioned
>      above and overhead of executing nbody_test increases.

There is third method.

rawbody  SJIS_BODY  eval:check_charset("Shift_JIS")
describe SJIS_BODY  Mail text is encoded with Shift JIS
score    SJIS_BODY 1.4

rawbody  JIS_BODY   eval:check_charset("ISO-2022-JP")
describe JIS_BODY   Mail text is encoded with JIS
score    JIS_BODY   -0.5

check_charset is a function that detect charset of rawbody using 
Encode::Detect::Encoder::detect.  I don't write this function yet though.

-- 
Motoharu Kubo
mkubo@3ware.co.jp

Re: I18n and l10n

Posted by Motoharu Kubo <mk...@3ware.co.jp>.
MATSUDA Yoh-ichi wrote:
> Is the above flow drawing correct or wrong?
> And, John-san and Motoharu-san's patches are:
> 
>                                                            |    |
>               +--------------------------------------------+    |
>               |                           (NEW!)                V
>               +-> converting html -> UTF-8 character -> [body]->+
>                   to plain text        normalization            |
>               +-------------------------------------------------+
>               |      (NEW!)
>               +-> tokenization -> [bayes]
>                    by Mecab

My opinion is to tokenize just after charset normalization.

UTF-8 character -> tokenization -> [body]
normalization

I wrote the reason why I insist on this flow several times.  In short,

(1) to join word separated by line break (eg. "a\nb" to "ab" if "ab" is
     the word)
(2) to clarify word boundary (eg. "youwon" -> "you won")

> Many Japanese spams are written in Shift-JIS codeset.
> Shift-JIS detecting rule is convenience.

My opinion is yes and no.

- There are many SJIS spams but also many iso-2022-jp encoded spams.
- All SJIS mails are not spams.  A careless alert mail sent from Windows
   application is also SJIS encoded (without base64/quoted-printable
   encoding).
- There might be some tendency or difference between SJIS spam and
   iso-2022-jp spam but not so significant, I think.
- Writing rule with hex notation is troublesome, boaring and decreases
   productivity.  If we could normalize charset, we could write rule
   directly with UTF-8 aware editor.

> But, if the character normalization will insert before body testing,
> my rule will be unavailable.
> 
> Do I have to re-write the above 2 rules from [body] to [rawbody]?

There are two possibilities.

(1) rewrite from BODY to RAWBODY as Matsuda-san says.
(2) invent NBODY (or something else) apart from BODY.  NBODY contains
     normalized and tokenized version of body.  I once thought of this
     idea but did not propose because BODY has problems I mentioned
     above and overhead of executing nbody_test increases.

-- 
Motoharu Kubo
mkubo@3ware.co.jp

Re: I18n and l10n

Posted by MATSUDA Yoh-ichi <yo...@flcl.org>.
Hello.

From: Motoharu Kubo <mk...@3ware.co.jp>
Subject: I18n and l10n (Re: Charset normalization issue (report, patch, and request))
Date: Sun, 15 Jan 2006 11:00:11 +0900

> I changed Subject.

Justin-san, John-san, and Motoharu-san, thanks a lot.

I think SA's message porcessing is:

raw mail -> [full] -> header part -> mime decoding -> [header] -+
              |(splitting)                                      |
              +-> body part -> mime decoding -> [rawbody] -+    |
                                                           |    |
              +--------------------------------------------+    |
              |                                                 V
              +-> converting html -> [body] ------------------->+
                  to plain text                                 |
              +-------------------------------------------------+
              |
              +-> tokenization -> [bayes]

# For properly watching, please use fixed font.

Is the above flow drawing correct or wrong?
And, John-san and Motoharu-san's patches are:

                                                           |    |
              +--------------------------------------------+    |
              |                           (NEW!)                V
              +-> converting html -> UTF-8 character -> [body]->+
                  to plain text        normalization            |
              +-------------------------------------------------+
              |      (NEW!)
              +-> tokenization -> [bayes]
                   by Mecab

Character normalization process is inserted before [body] testing.
So, Motoharu-san's patch is able to write Japanese character matching
rule directly in an user_prefs.

BTW, I wrote detecting character codeset rule:

body UTF8      /(([\xe0-\xef][\x80-\xbf][\x80-\xbf])(?!([\x81-\x9f\xe0-\xfc][\x40-\x7e\xc0-\xfc]|[\x81-\x9f\xf0-\xfc][\x40-\x7e\xc0-\xfc]|[\xc0-\xfc][\x40-\x7e\xc0-\xfc]))){5,}/

body SJIS_C /(([\x81-\x9f\xe0-\xfc][\x40-\x7e\x80-\xfc])(?!([\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80-\xbf]|[\xa1-\xfe][\xa1-\xfe]))){7,}/

Many Japanese spams are written in Shift-JIS codeset.
Shift-JIS detecting rule is convenience.

But, if the character normalization will insert before body testing,
my rule will be unavailable.

Do I have to re-write the above 2 rules from [body] to [rawbody]?
--
Matsuda Yoh-ich(yoh)
mailto:yoh@flcl.org
http://www.flcl.org/~yoh/diary/