You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Joshua, C.S. Chen" <cs...@asiaa.sinica.edu.tw> on 2006/07/13 09:17:05 UTC

BAYES_99 makes lots of false-positive

Hello folks,
My users speak Chinese. I found that spamassassin seems not working well
about chinese chset (utf8 or big5) on the bayes issue. Many normal mails
(almost) get BAYES_99 score although the real spam also get BAYES_99. It
looks like foreign language like Chinese is very easy to be high bayes
scored.
I have setup ok_locales all but it doesn't help the false-positive problem.

And another question: just wonder what if I do sa-learn --dump? Am I
supposed to see the phrase that SA has learned? some key phrases, words
in the spam mails? If so, can I see some chinese phrases?


Cheers
Joshua


Re: BAYES_99 makes lots of false-positive

Posted by "Joshua, C.S. Chen" <cs...@asiaa.sinica.edu.tw>.
Matt Kettler wrote:

>
>In sa 2.6x or older, yes.. in sa 3.0.0 or higher, no.
>
>First, phrases isn't quite accurate.. bayes stores tokens, and most of
>the tokens are simply words, not phrases.
>
>In SA 3.0.0 or higher the text tokens themselves are not stored, only
>the SHA1 hash of them is stored. This cannot be easily reversed to
>figure out what the text token was, but it's easy to figure out the hash
>of another token and compare the two. Thus, it's impossible for dump to
>display the text tokens, it doesn't know what they are.
>
>The main reason to do this in SA 3.x is performance. All the SHA hashes
>are the same size. No more variable-length string compares, just
>straight fixed-width binary compares. Ditto for record reads. A side
>effect is increased security.. nobody can look at your bayes DB and make
>assumptions about what your email conversations talk about.
>
>  
>


Thanks Matt, for the details.


>If you want to see the text tokens that match bayes for a particular
>message, you can do this by feeding a message to spamassassin in bayes
>debug mode..
>
>spamassassin -D bayes=255 <
>
>>
>>    
>>
>>>some key phrases, words
>>>in the spam mails? If so, can I see some chinese phrases?
>>>  
>>>      
>>>
>>I've never tried, but the above should work for Chinese text, provided
>>your local terminal supports it.
>>    
>>
>message.txt
>
>That should let you know which tokens in the message are matching bayes,
>and what  each gets (from 0.0000 to 1.0000, which represents
>0% to 100%).
>
>Word of advice: if you see a LOT of innocuous words matching in the
>range of 0.90-1.0 you can worry. But do not worry about every single
>word that seems "wrong". A typical message will match a dozen or more
>tokens.
>
>All that said, how do you fix it? Feed your problem messages to sa-learn
>--ham. If it's really bad, wipe your bayes DB and start over.
>
>  
>


It sounds great to be able to see which tokens mach those in the bayes db.
I tried a test message with -D bayes=255 like




$ spamassassin -D bayes=255 < /tmp/message
>From cschen@asiaa.sinica.edu.tw Fri Jul 14 10:32:01 2006
Return-Path: <cs...@asiaa.sinica.edu.tw>
X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on
asiaa.sinica.edu.tw
X-Spam-Level:
X-Spam-Status: No, score=-102.2 required=6.0 tests=ALL_TRUSTED,AWL,
FROM_IAA_LOCAL_SITE1,USER_IN_WHITELIST autolearn=no version=3.1.0
Received: from [140.109.177.202] (genesis.asiaa.sinica.edu.tw
[140.109.177.202])
by asiaa.sinica.edu.tw (8.13.1/8.13.1) with ESMTP id k6E2VqVw011774
for <cs...@asiaa.sinica.edu.tw>; Fri, 14 Jul 2006 10:31:52 +0800
Message-ID: <44...@asiaa.sinica.edu.tw>
Date: Fri, 14 Jul 2006 10:31:52 +0800
From: "Joshua, C.S. Chen" <cs...@asiaa.sinica.edu.tw>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.13)
Gecko/20060418 Red Hat/1.7.13-1.4.1
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: =?Big5?B?rEyswA==?= <cs...@asiaa.sinica.edu.tw>
Subject: test for spamassassin -D bayes=255
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Virus-Scanned: by amavisd-new
X-Keywords:
X-UID: 9719
Status: O
Content-Length: 88
Lines: 4

This is a test. How I want to see the tokens' details that bayes thinks.

Cheers
Joshua







It just showed the original message, not the tokens and probabilities.
Am I missing something here?


Thanks very much

Cheers
Joshua

Re: BAYES_99 makes lots of false-positive

Posted by Matt Kettler <mk...@comcast.net>.
Joshua, C.S. Chen wrote:
> Hello folks,
> My users speak Chinese. I found that spamassassin seems not working well
> about chinese chset (utf8 or big5) on the bayes issue. Many normal mails
> (almost) get BAYES_99 score although the real spam also get BAYES_99. It
> looks like foreign language like Chinese is very easy to be high bayes
> scored.
> I have setup ok_locales all but it doesn't help the false-positive problem.
>
> And another question: just wonder what if I do sa-learn --dump? Am I
> supposed to see the phrase that SA has learned? 
In sa 2.6x or older, yes.. in sa 3.0.0 or higher, no.

First, phrases isn't quite accurate.. bayes stores tokens, and most of
the tokens are simply words, not phrases.

In SA 3.0.0 or higher the text tokens themselves are not stored, only
the SHA1 hash of them is stored. This cannot be easily reversed to
figure out what the text token was, but it's easy to figure out the hash
of another token and compare the two. Thus, it's impossible for dump to
display the text tokens, it doesn't know what they are.

The main reason to do this in SA 3.x is performance. All the SHA hashes
are the same size. No more variable-length string compares, just
straight fixed-width binary compares. Ditto for record reads. A side
effect is increased security.. nobody can look at your bayes DB and make
assumptions about what your email conversations talk about.

If you want to see the text tokens that match bayes for a particular
message, you can do this by feeding a message to spamassassin in bayes
debug mode..

spamassassin -D bayes=255 <message.txt

That should let you know which tokens in the message are matching bayes,
and what probability each gets (from 0.0000 to 1.0000, which represents
0% to 100%).

Word of advice: if you see a LOT of innocuous words matching in the
range of 0.90-1.0 you can worry. But do not worry about every single
word that seems "wrong". A typical message will match a dozen or more
tokens.

All that said, how do you fix it? Feed your problem messages to sa-learn
--ham. If it's really bad, wipe your bayes DB and start over.



> some key phrases, words
> in the spam mails? If so, can I see some chinese phrases?
>   
I've never tried, but the above should work for Chinese text, provided
your local terminal supports it.


Re: BAYES_99 makes lots of false-positive

Posted by Johann Spies <js...@sun.ac.za>.
On Thu, Jul 13, 2006 at 03:17:05PM +0800, Joshua, C.S. Chen wrote:
> Hello folks,
> My users speak Chinese. I found that spamassassin seems not working well
> about chinese chset (utf8 or big5) on the bayes issue. Many normal mails
> (almost) get BAYES_99 score although the real spam also get BAYES_99. It
> looks like foreign language like Chinese is very easy to be high bayes
> scored.
> I have setup ok_locales all but it doesn't help the false-positive problem.
> 
> And another question: just wonder what if I do sa-learn --dump? Am I
> supposed to see the phrase that SA has learned? some key phrases, words
> in the spam mails? If so, can I see some chinese phrases?

Do you use chinese emails to "feed" the spamfilter both ham and spam
regularly?  That would probably be the best way to improve the accuracy
of the Bayesian filter.

Regards
Johann
-- 
Johann Spies          Telefoon: 021-808 4036
Informasietegnologie, Universiteit van Stellenbosch

     "Let your character be free from the love of money,
      being content with what you have; for He Himself has
      said, "I will never desert you, nor will I ever
      forsake you."
                              Hebrews 13:5