You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by David Kentwood <da...@gmail.com> on 2012/07/24 00:39:21 UTC

spamassassin bayesian training on foreign characters

Hello,

I get a lot of foreign spams (eg. chinese, russian, etc) and am thinking of
training spamassassin to identify such spams. My questions are:

1) can a stock install of spamassassin recognize foreign characters without
special configurations?

2) how well does Bayesian training work on foreign spams?

Thanks for any advice on this matter.

Dave

Re: spamassassin bayesian training on foreign characters

Posted by RW <rw...@googlemail.com>.
On Tue, 24 Jul 2012 08:36:53 -0400
David F. Skoll wrote:

> On Tue, 24 Jul 2012 09:41:19 +0200
> Simon Loewenthal <si...@klunky.co.uk> wrote:
> 
> >    I have Bayes correctly  scoring BAYES_99 on Dutch and French
> > straight out of the box. No problems. --

Dutch, French etc are very similar to English with most characters being
compatible with ascii.
 
> It does work, but with a caveat: SpamAssassin does not normalize the
> character set.  So if you train it on Chinese in the GB2312 character
> set, that will do nothing for you if you receive UTF-8 Chinese spam.
> Furthermore, if some random character set A and another random
> character set B share byte sequences, your Bayes training may confuse
> them.
> 
> Also, I don't believe SpamAssassin has any type of logic for
> recognizing word boundaries in ideographic character sets vs.
> alphabetic ones.

There's also a problem with non-roman alphabets represented with
multibyte characters  whereby the maximum token length (15) is hit on
relatively short words. There is some attempt to work around this by
converting such tokens into byte pairs.


> Bayes is pretty robust, so it "works" in the face of a lot of noise,
> but SA's implementation still leaves quite a bit to be desired.

In most spams aimed at English speakers, spammers avoid leaving any
useful tokens in the text and Bayes still works with headers and
mark-up. 


Re: spamassassin bayesian training on foreign characters

Posted by "David F. Skoll" <df...@roaringpenguin.com>.
On Tue, 24 Jul 2012 09:41:19 +0200
Simon Loewenthal <si...@klunky.co.uk> wrote:

>    I have Bayes correctly  scoring BAYES_99 on Dutch and French
> straight out of the box. No problems. --

It does work, but with a caveat: SpamAssassin does not normalize the
character set.  So if you train it on Chinese in the GB2312 character
set, that will do nothing for you if you receive UTF-8 Chinese spam.
Furthermore, if some random character set A and another random
character set B share byte sequences, your Bayes training may confuse
them.

Also, I don't believe SpamAssassin has any type of logic for recognizing
word boundaries in ideographic character sets vs. alphabetic ones.

Bayes is pretty robust, so it "works" in the face of a lot of noise, but
SA's implementation still leaves quite a bit to be desired.

Regards,

David.

Re: spamassassin bayesian training on foreign characters

Posted by David Kentwood <da...@gmail.com>.
Thanks for the replies. It's good to have some confirmations!

On Tue, Jul 24, 2012 at 3:41 AM, Simon Loewenthal <si...@klunky.co.uk>wrote:

> Hi
>
>    I have Bayes correctly  scoring BAYES_99 on Dutch and French straight
> out of the box. No problems.
> --
> Dogs are tough.
> I've been interrogating this one for hours and he still won't tell me
> who's a good boy.
>   simon@klunky / .co.uk / .org
>
> John Hardin <jh...@impsec.org> wrote:
>
> >On Mon, 23 Jul 2012, David Kentwood wrote:
> >
> >> Hello,
> >>
> >> I get a lot of foreign spams (eg. chinese, russian, etc) and am
> >thinking of
> >> training spamassassin to identify such spams. My questions are:
> >>
> >> 1) can a stock install of spamassassin recognize foreign characters
> >without
> >> special configurations?
> >
> >Yes.
> >
> >> 2) how well does Bayesian training work on foreign spams?
> >
> >Quite well here. I have trained it on chinese, portuguese and spanish
> >and
> >it always hits BAYES_99 on such.
> >
> >> Thanks for any advice on this matter.
> >
> >There shouldn't be anything special about the language w/r/t bayes.
>
>

Re: spamassassin bayesian training on foreign characters

Posted by Simon Loewenthal <si...@klunky.co.uk>.
Hi

   I have Bayes correctly  scoring BAYES_99 on Dutch and French straight out of the box. No problems.
--
Dogs are tough. 
I've been interrogating this one for hours and he still won't tell me who's a good boy. 
  simon@klunky / .co.uk / .org

John Hardin <jh...@impsec.org> wrote:

>On Mon, 23 Jul 2012, David Kentwood wrote:
>
>> Hello,
>>
>> I get a lot of foreign spams (eg. chinese, russian, etc) and am
>thinking of
>> training spamassassin to identify such spams. My questions are:
>>
>> 1) can a stock install of spamassassin recognize foreign characters
>without
>> special configurations?
>
>Yes.
>
>> 2) how well does Bayesian training work on foreign spams?
>
>Quite well here. I have trained it on chinese, portuguese and spanish
>and 
>it always hits BAYES_99 on such.
>
>> Thanks for any advice on this matter.
>
>There shouldn't be anything special about the language w/r/t bayes.


Re: spamassassin bayesian training on foreign characters

Posted by John Hardin <jh...@impsec.org>.
On Mon, 23 Jul 2012, David Kentwood wrote:

> Hello,
>
> I get a lot of foreign spams (eg. chinese, russian, etc) and am thinking of
> training spamassassin to identify such spams. My questions are:
>
> 1) can a stock install of spamassassin recognize foreign characters without
> special configurations?

Yes.

> 2) how well does Bayesian training work on foreign spams?

Quite well here. I have trained it on chinese, portuguese and spanish and 
it always hits BAYES_99 on such.

> Thanks for any advice on this matter.

There shouldn't be anything special about the language w/r/t bayes.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Gun Control laws cannot reduce violent crime, because gun control
   laws focus obsessively on a tool a criminal might use to commit a
   crime rather than the criminal himself and his act of violence.
-----------------------------------------------------------------------
  13 days until the rover Curiosity lands on Mars