You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jason Haar <Ja...@trimble.co.nz> on 2005/11/02 02:55:19 UTC

any extra language effort for SA? (esp. Asian SPAM)

Hi there

I just did a stat-run on email I received 31st Oct, and found that of
the mail SA scored lower than 5/5 (i.e. SA classified as "ham"), a large
amount was SPAM. In fact it only caught 80% of the SPAM I received that
day (this is with SA 3.1.0)

Of that I was able to tell that the vast majority of "missed" SPAM was
actually Asian SPAM - the Subject: lines alone were 100% non-ASCII - bit
of a give-away as I am ignorant and can't speak anything but
Kiwi-English ;-)

If I removed that Asian SPAM from the figures, the effectiveness of SA
shot up to 98% - pretty darn good!

Now personally I can run SA on my workstation with "ok_locales en" and
bang extra points onto non-English mail - but I certainly can't do that
for our company as a whole - which has customers from every
country/nationality, etc.

So the only thing I can think of is that there appears to be a need for
more non-English rulesets to add points for different language usages of
viagra/porn/whatever.

Am I correct in my thinking, and if so is the SA group getting help from
non-English developers to make this happen? I see a couple of
"body_test" rules that appear to be for Spanish and Polish - but no others?

-- 
Cheers

Jason Haar
Information Security Manager, Trimble Navigation Ltd.
Phone: +64 3 9635 377 Fax: +64 3 9635 417
PGP Fingerprint: 7A2E 0407 C9A6 CAF6 2B9F 8422 C063 5EBB FE1D 66D1

Re: any extra language effort for SA? (esp. Asian SPAM)

Posted by Alan Premselaar <al...@12inch.com>.

Jason Haar wrote:
> Hi there
> 
> I just did a stat-run on email I received 31st Oct, and found that of
> the mail SA scored lower than 5/5 (i.e. SA classified as "ham"), a large
> amount was SPAM. In fact it only caught 80% of the SPAM I received that
> day (this is with SA 3.1.0)
> 
> Of that I was able to tell that the vast majority of "missed" SPAM was
> actually Asian SPAM - the Subject: lines alone were 100% non-ASCII - bit
> of a give-away as I am ignorant and can't speak anything but
> Kiwi-English ;-)
> 
> If I removed that Asian SPAM from the figures, the effectiveness of SA
> shot up to 98% - pretty darn good!
> 
> Now personally I can run SA on my workstation with "ok_locales en" and
> bang extra points onto non-English mail - but I certainly can't do that
> for our company as a whole - which has customers from every
> country/nationality, etc.
> 
> So the only thing I can think of is that there appears to be a need for
> more non-English rulesets to add points for different language usages of
> viagra/porn/whatever.
> 
> Am I correct in my thinking, and if so is the SA group getting help from
> non-English developers to make this happen? I see a couple of
> "body_test" rules that appear to be for Spanish and Polish - but no others?
> 

Jason,

  I know that I have personally contributed some rules to catch certain 
phrases in Japanese, however this seems like a really scenario for 
manual bayes training.

While the auto-learning is convenient and often "good enough", I think 
the general concensus is that you should do at least a certain bit of 
manual training so that your bayes databases better represent your mail 
traffic patterns.

hope this helps,

alan