You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Kim Christensen <ki...@01.se> on 2007/01/25 12:56:41 UTC

True spam getting really low Bayesian points

Hey list,

I've recently started training our bayesian filter with spam/ham from my
personal mailbox, to prepare for live usage on our customer accounts.

% sa-learn --dump magic
...
0.000          0        340          0  non-token data: nspam
0.000          0        475          0  non-token data: nham
0.000          0      53404          0  non-token data: ntokens
...

So far so good, and spamd is actually using the bayesian db when
examining incoming mails. However, I find that a few of the legit ham 
(not a majority) mails get unusually high bayesian points, while some
of the real spam (which gets scored as spam by sa) often get bayesian
points < 1. 

Now, I'm sure I haven't trained the database with wrong messages. Is it
a good idea to continue feeding sa-learn with example spam and ham until
it reaches a few thousands messages, before relying on the results?

I would think my current amount is sufficient, but I guess something's
wrong with this picture :-)


Best regards
-- 
Kim Christensen
"You just had a near-life experience."

Re: True spam getting really low Bayesian points

Posted by Matt Kettler <mk...@verizon.net>.

Kim Christensen wrote:
> Hey list,
>
> I've recently started training our bayesian filter with spam/ham from my
> personal mailbox, to prepare for live usage on our customer accounts.
>
> % sa-learn --dump magic
> ...
> 0.000          0        340          0  non-token data: nspam
> 0.000          0        475          0  non-token data: nham
> 0.000          0      53404          0  non-token data: ntokens
> ...
>
> So far so good, and spamd is actually using the bayesian db when
> examining incoming mails. However, I find that a few of the legit ham 
> (not a majority) mails get unusually high bayesian points, while some
> of the real spam (which gets scored as spam by sa) often get bayesian
> points < 1. 
>
> Now, I'm sure I haven't trained the database with wrong messages. Is it
> a good idea to continue feeding sa-learn with example spam and ham until
> it reaches a few thousands messages, before relying on the results?
>
> I would think my current amount is sufficient, but I guess something's
> wrong with this picture :-)
>
>
>
>   
If you want to see what the tokens are that are throwing bayes off, try
running a mis-categorized message through spamassassin -D bayes. This
will turn on bayes debugging, and will print all the bayes-matching
tokens in the message (in text form) and their individual probabilities.

It's completely normal for a message to have a few tokens on "the wrong
side". So don't over-worry about testing every message this way, that
can lead to the mistake of micro-managing your bayes. However, it can be
useful to figure out what bayes is thinking when you have odd results.

Re: True spam getting really low Bayesian points

Posted by Kim Christensen <ki...@01.se>.

* maillist <ma...@emailacs.com> [2007-01-25 10:21:47 -0600]:

> Kim Christensen wrote:
> >Hey list,
> >
> >I've recently started training our bayesian filter with spam/ham from my
> >personal mailbox, to prepare for live usage on our customer accounts.
> >
> >% sa-learn --dump magic
> >...
> >0.000          0        340          0  non-token data: nspam
> >0.000          0        475          0  non-token data: nham
> >0.000          0      53404          0  non-token data: ntokens
> >...
> >
> >So far so good, and spamd is actually using the bayesian db when
> >examining incoming mails. However, I find that a few of the legit ham 
> >(not a majority) mails get unusually high bayesian points, while some
> >of the real spam (which gets scored as spam by sa) often get bayesian
> >points < 1. 
> >
> >Now, I'm sure I haven't trained the database with wrong messages. Is it
> >a good idea to continue feeding sa-learn with example spam and ham until
> >it reaches a few thousands messages, before relying on the results?
> >
> >I would think my current amount is sufficient, but I guess something's
> >wrong with this picture :-)
> >
> >
> >Best regards
> >  
> Run spamassassin --test-mode on the messages that are scoring high and 
> low.  See if they are actually running through any BAYES_* tests.  I'm 
> not 100% sure but I think that by default, the bayes do not even begin 
> until you have 500 trained messages of each spam and ham.
> 
> You can of course get around this by setting bayes_min_ham_num  and  
> bayes_min_spam_num in your local.cf file.

Yeah, an example spam message marked with 17 points by SA gets the
following result when running a test scan against it:

...
 0.0 BAYES_50               BODY: Bayesian spam probability is 40 to 60%
                            [score: 0.5106]
...

Surely it runs through a Bayesian filter, and all the other scanning
methods are going wild about it - but not the BAYES_* test! H


Best regards
-- 
Kim Christensen
"I am Jack's smirking revenge."

Re: True spam getting really low Bayesian points

Posted by maillist <ma...@emailacs.com>.

maillist wrote:
> Kim Christensen wrote:
>> Hey list,
>>
>> I've recently started training our bayesian filter with spam/ham from my
>> personal mailbox, to prepare for live usage on our customer accounts.
>>
>> % sa-learn --dump magic
>> ...
>> 0.000          0        340          0  non-token data: nspam
>> 0.000          0        475          0  non-token data: nham
>> 0.000          0      53404          0  non-token data: ntokens
>> ...
>>
>> So far so good, and spamd is actually using the bayesian db when
>> examining incoming mails. However, I find that a few of the legit ham 
>> (not a majority) mails get unusually high bayesian points, while some
>> of the real spam (which gets scored as spam by sa) often get bayesian
>> points < 1.
>> Now, I'm sure I haven't trained the database with wrong messages. Is it
>> a good idea to continue feeding sa-learn with example spam and ham until
>> it reaches a few thousands messages, before relying on the results?
>>
>> I would think my current amount is sufficient, but I guess something's
>> wrong with this picture :-)
>>
>>
>> Best regards
>>   
> Run spamassassin --test-mode on the messages that are scoring high and 
> low.  See if they are actually running through any BAYES_* tests.  I'm 
> not 100% sure but I think that by default, the bayes do not even begin 
> until you have 500 trained messages of each spam and ham.
>
> You can of course get around this by setting bayes_min_ham_num  and  
> bayes_min_spam_num in your local.cf file.
>
> -=Aubrey=-
>
The default for 3.* is 200 messages for each.  Sorry dude.

-=Aubrey=-

Re: True spam getting really low Bayesian points

Posted by maillist <ma...@emailacs.com>.

Kim Christensen wrote:
> Hey list,
>
> I've recently started training our bayesian filter with spam/ham from my
> personal mailbox, to prepare for live usage on our customer accounts.
>
> % sa-learn --dump magic
> ...
> 0.000          0        340          0  non-token data: nspam
> 0.000          0        475          0  non-token data: nham
> 0.000          0      53404          0  non-token data: ntokens
> ...
>
> So far so good, and spamd is actually using the bayesian db when
> examining incoming mails. However, I find that a few of the legit ham 
> (not a majority) mails get unusually high bayesian points, while some
> of the real spam (which gets scored as spam by sa) often get bayesian
> points < 1. 
>
> Now, I'm sure I haven't trained the database with wrong messages. Is it
> a good idea to continue feeding sa-learn with example spam and ham until
> it reaches a few thousands messages, before relying on the results?
>
> I would think my current amount is sufficient, but I guess something's
> wrong with this picture :-)
>
>
> Best regards
>   
Run spamassassin --test-mode on the messages that are scoring high and 
low.  See if they are actually running through any BAYES_* tests.  I'm 
not 100% sure but I think that by default, the bayes do not even begin 
until you have 500 trained messages of each spam and ham.

You can of course get around this by setting bayes_min_ham_num  and  
bayes_min_spam_num in your local.cf file.

-=Aubrey=-