You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Yu Qian <ji...@gmail.com> on 2016/04/12 18:44:57 UTC

How does SpamAssassin processing languages other than English

SpamAssassin used Bayes as classier, this is typical and efficient for
English. But how does it processing languages like Asian language?

Can anyone introduce that or anyone can show the code where SpamAssassin do
that?

Thanks, guys!

Re: How does SpamAssassin processing languages other than English

Posted by Joe Quinn <jq...@pccc.com>.

On 4/12/2016 1:16 PM, Reindl Harald wrote:
>
>
> Am 12.04.2016 um 18:44 schrieb Yu Qian:
>> SpamAssassin used Bayes as classier, this is typical and efficient for
>> English. But how does it processing languages like Asian language?
>>
>> Can anyone introduce that or anyone can show the code where SpamAssassin
>> do that?
>
> bayes is by definition language agnostic
>
> *you train* bayes with samples of ham and spam (at least a few hundret 
> of both) and the tokenizer splits the messages in parts and creates a 
> database which words appear how often in spam and ham (simplified 
> explained)
While that's true, tokenizing languages that don't delimit words by 
whitespace is extremely difficult. For languages like Chinese, it can 
only be done by carrying around a language dictionary.

Yu Qian, if you're up to reading code you may want to look at 
lib/Mail/SpamAssassin/Bayes.pm and 
lib/Mail/SpamAssassin/Plugin/Bayes.pm. I'm not familiar enough with the 
Bayes side of SA to say for sure, but you might be able to configure it 
or write a plugin that can do the tokenization you desire. You may also 
be able to reuse existing research from http://nlp.stanford.edu/ and such.

Re: How does SpamAssassin processing languages other than English

Posted by Yu Qian <ji...@gmail.com>.

Cool, thanks guys, i think I have a good sense of how SpamAssassin works
now. we are doing some spam project, that's amazing to have SpamAssassin.

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Wed, Apr 13, 2016 at 8:21 AM, RW <rw...@googlemail.com> wrote:

> On Tue, 12 Apr 2016 14:15:50 -0400
> Dianne Skoll wrote:
>
> > On Tue, 12 Apr 2016 13:41:51 -0400
> > Yu Qian <ji...@gmail.com> wrote:
> >
> > > Yup, that's right, it becomes difficult if we want to support
> > > multiple language in one spam detection solution. and it's true
> > > that there are some best practice for single language. but didn't
> > > see too much support multiple
> >
> > The only practical approach is to normalize everything into Unicode
> > and tokenize Unicode characters.  (We actually use UTF-8 as the
> > on-disk representation.)
> >
> > We have a custom Bayes engine that treats any character in the CJK
> > Unified Ideographs range as a word.  This is not strictly correct
> > because there are two-character (and longer) CJK words, but it's close
> > enough,
>
> What happens in mainstream SpamAssassin is that if a word is over 15
> bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in
> place of the original word. Everything can be normalized to UTF-8 with
> "normalize_charset 1"
>
> This will likely work fairly well for CJK, but won't work well for any 3
> or 4 byte UTF-8  alphabet that isn't composed of ideograms (unless
> it's only in spam). This includes most Asian and African languages.
>
> I think the best solution to this is simply to retain the original
> long-word as a token - or to allow it as an option.
>
> Setting normalize_charset also helps with custom rules if you edit them
> as  UTF-8, but it's important to remember that SA sees a multibyte
> character as a sequence of bytes rather than a single charcter. For
> example you can't put a non-ascii character between square brackets.
>

Re: How does SpamAssassin processing languages other than English

Posted by RW <rw...@googlemail.com>.

On Tue, 12 Apr 2016 14:15:50 -0400
Dianne Skoll wrote:

> On Tue, 12 Apr 2016 13:41:51 -0400
> Yu Qian <ji...@gmail.com> wrote:
> 
> > Yup, that's right, it becomes difficult if we want to support
> > multiple language in one spam detection solution. and it's true
> > that there are some best practice for single language. but didn't
> > see too much support multiple  
> 
> The only practical approach is to normalize everything into Unicode
> and tokenize Unicode characters.  (We actually use UTF-8 as the
> on-disk representation.)
> 
> We have a custom Bayes engine that treats any character in the CJK
> Unified Ideographs range as a word.  This is not strictly correct
> because there are two-character (and longer) CJK words, but it's close
> enough,

What happens in mainstream SpamAssassin is that if a word is over 15
bytes long then 3 and 4 byte UTF-8 characters are extracted as tokens in
place of the original word. Everything can be normalized to UTF-8 with 
"normalize_charset 1"

This will likely work fairly well for CJK, but won't work well for any 3
or 4 byte UTF-8  alphabet that isn't composed of ideograms (unless
it's only in spam). This includes most Asian and African languages. 

I think the best solution to this is simply to retain the original
long-word as a token - or to allow it as an option.

Setting normalize_charset also helps with custom rules if you edit them
as  UTF-8, but it's important to remember that SA sees a multibyte
character as a sequence of bytes rather than a single charcter. For
example you can't put a non-ascii character between square brackets.

Re: How does SpamAssassin processing languages other than English

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Tue, 12 Apr 2016 17:00:21 -0400
Yu Qian <ji...@gmail.com> wrote:

> That's nice to hear SpamAssassin can looks at word pairs,

Sorry, maybe I wasn't clear... I was talking about our own Bayes engine.
AFAIK, the SpamAssassin Bayes engine only looks at single words.

Regards,

Dianne.

Re: How does SpamAssassin processing languages other than English

Posted by Yu Qian <ji...@gmail.com>.

That's nice to hear SpamAssassin can looks at word pairs, As I am new to
SpamAssassin, so still trying to find out more interesting things of it.

According to the word pairs stuff, does SpamAssassin can detect word like
this: if a single word is splitted by space, like Free appeared in a email
as the format F R E E. ?

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198

On Tue, Apr 12, 2016 at 2:15 PM, Dianne Skoll <df...@roaringpenguin.com>
wrote:

> On Tue, 12 Apr 2016 13:41:51 -0400
> Yu Qian <ji...@gmail.com> wrote:
>
> > Yup, that's right, it becomes difficult if we want to support multiple
> > language in one spam detection solution. and it's true that there are
> > some best practice for single language. but didn't see too much
> > support multiple
>
> The only practical approach is to normalize everything into Unicode and
> tokenize Unicode characters.  (We actually use UTF-8 as the on-disk
> representation.)
>
> We have a custom Bayes engine that treats any character in the CJK
> Unified Ideographs range as a word.  This is not strictly correct
> because there are two-character (and longer) CJK words, but it's close
> enough, especially because our Bayes engine also looks at word pairs.
>
> I think this is a Summer of Code project for SpamAssassin. :)
>
> Regards,
>
> Dianne.
>

Re: How does SpamAssassin processing languages other than English

Posted by Dianne Skoll <df...@roaringpenguin.com>.

On Tue, 12 Apr 2016 13:41:51 -0400
Yu Qian <ji...@gmail.com> wrote:

> Yup, that's right, it becomes difficult if we want to support multiple
> language in one spam detection solution. and it's true that there are
> some best practice for single language. but didn't see too much
> support multiple

The only practical approach is to normalize everything into Unicode and
tokenize Unicode characters.  (We actually use UTF-8 as the on-disk
representation.)

We have a custom Bayes engine that treats any character in the CJK
Unified Ideographs range as a word.  This is not strictly correct
because there are two-character (and longer) CJK words, but it's close
enough, especially because our Bayes engine also looks at word pairs.

I think this is a Summer of Code project for SpamAssassin. :)

Regards,

Dianne.

Re: How does SpamAssassin processing languages other than English

Posted by Yu Qian <ji...@gmail.com>.

Yup, that's right, it becomes difficult if we want to support multiple
language in one spam detection solution. and it's true that there are some
best practice for single language. but didn't see too much support multiple

---
Yu Qian
Ottawa Ontario
Phone: (514)-553-0198



On Tue, Apr 12, 2016 at 1:38 PM, Reindl Harald <h....@thelounge.net>
wrote:

> STAY ON LIST
>
> Am 12.04.2016 um 19:22 schrieb Yu Qian:
>
>> Yes, right, what I am interested is that as Chinese language is
>> different. so does SpamAssassin has a strong tokenizer to do that? or
>> they just use the same tokenizer?
>>
>> ---
>> Yu Qian
>> Ottawa Ontario
>> Phone: (514)-553-0198
>>
>>
>>
>> On Tue, Apr 12, 2016 at 1:16 PM, Reindl Harald <h.reindl@thelounge.net
>> <ma...@thelounge.net>> wrote:
>>
>>
>>
>>     Am 12.04.2016 um 18:44 schrieb Yu Qian:
>>
>>         SpamAssassin used Bayes as classier, this is typical and
>>         efficient for
>>         English. But how does it processing languages like Asian language?
>>
>>         Can anyone introduce that or anyone can show the code where
>>         SpamAssassin
>>         do that?
>>
>>
>>     bayes is by definition language agnostic
>>
>>     *you train* bayes with samples of ham and spam (at least a few
>>     hundret of both) and the tokenizer splits the messages in parts and
>>     creates a database which words appear how often in spam and ham
>>     (simplified explained)
>>
>>
>>
>>
>>
> --
>
> Reindl Harald
> the lounge interactive design GmbH
> A-1060 Vienna, Hofmühlgasse 17
> CTO / CISO / Software-Development
> m: +43 (676) 40 221 40, p: +43 (1) 595 3999 33
> icq: 154546673, http://www.thelounge.net/
>
> http://www.thelounge.net/signature.asc.what.htm
>
>

Re: How does SpamAssassin processing languages other than English

Posted by Reindl Harald <h....@thelounge.net>.


Am 12.04.2016 um 18:44 schrieb Yu Qian:
> SpamAssassin used Bayes as classier, this is typical and efficient for
> English. But how does it processing languages like Asian language?
>
> Can anyone introduce that or anyone can show the code where SpamAssassin
> do that?

bayes is by definition language agnostic

*you train* bayes with samples of ham and spam (at least a few hundret 
of both) and the tokenizer splits the messages in parts and creates a 
database which words appear how often in spam and ham (simplified explained)