You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2006/01/17 21:34:54 UTC

Re: I18n and l10n

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


MATSUDA Yoh-ichi writes:
> > - Writing rule with hex notation is troublesome, boaring and decreases
> >    productivity.  If we could normalize charset, we could write rule
> >    directly with UTF-8 aware editor.
> 
> Yes.
> Directly writing REGEX rule with UTF-8 character is very convenience.
> But I think character normalization and tokenization before body
> testing is troublesome.
> Because, character normalization and tokenization is modifying
> message text, so REGEX rule writer can't recognize against the
> modified text.
> 
> Many rules are written for pure plain message text.
> If character normalization and tokenization are inserted before body
> testing, many body rules will be unavailable.
> 
> So,
> 
> > > But, if the character normalization will insert before body testing,
> > > my rule will be unavailable.
> > > 
> > > Do I have to re-write the above 2 rules from [body] to [rawbody]?
> > 
> > There are two possibilities.
> > 
> > (1) rewrite from BODY to RAWBODY as Matsuda-san says.
> > (2) invent NBODY (or something else) apart from BODY.  NBODY contains
> >      normalized and tokenized version of body.  I once thought of this
> >      idea but did not propose because BODY has problems I mentioned
> >      above and overhead of executing nbody_test increases.
> 
> I want (2), for the reason of compatibility of rules.

+1, agreed.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDzVTuMJF5cimLx9ARAqxOAKCILBFwluZj3/yicF3aPBSTpy8vigCgkZ7C
kn0sKCBOmjDJRpSRh5LYVsw=
=eJbr
-----END PGP SIGNATURE-----

Re: I18n and l10n

Posted by Motoharu Kubo <mk...@3ware.co.jp>.

Justin Mason wrote:
>>>(1) rewrite from BODY to RAWBODY as Matsuda-san says.
>>>(2) invent NBODY (or something else) apart from BODY.  NBODY contains
>>>     normalized and tokenized version of body.  I once thought of this
>>>     idea but did not propose because BODY has problems I mentioned
>>>     above and overhead of executing nbody_test increases.
>>
>>I want (2), for the reason of compatibility of rules.
> 
> 
> +1, agreed.

I talked to Matsuda-san and I also now agree the idea of NBODY because
of compatibility issue for existing ruleset is extremly important.

I wrote that I don't like charset normalization and related features to
be option, but I changed my position.  It should be compile option or SA
option because UTF-8 aware regex will result performance loss.  Not all
SA users want this feature.  Instead, I want NBODY and Bayes with
normalized and tokenized text to be fully UTF-8 aware.

-- 
Motoharu Kubo
mkubo@3ware.co.jp