You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Rakesh <ra...@netcore.co.in> on 2004/07/17 21:56:52 UTC

Unusual Bayes behaviour

hi ,

I have set up a Mailserver with Postfix+MailScanner+SpamAssassin+Cyrus. 
Everything went fine for the first 4 months till recently when Bayes 
started showing some unusual behaviour.

For a same kind of mail Bayes is giving a variable score. Say i recieved 
a mail for which Bayes_90 gave a score of 2.101, 3 hours later when the 
same kind of mail appeared again it was Bayes_60 score with 1.592 and 
then in the next one hour for the same mail it got a negative value. 
This has really started making me sweat, Many of the spam messages get 
detected in one instance and are left undetected in other. I suspected 
my sa-learn mechanism to be behind this variable score as for the given 
mails Bayes might have got some more number of HAM tokens during 
feedback. But the kind of feedback mechansim i have implemented in not 
reading from any inbox folder but works like this any mails to 
nospam@mydomain.com is fed to the sa-learn using a perl wrapper script. 
When i checked my logs for a possible HAM feedback during the time 
period, I didnt find a single entry for HAM feedback which left me in 
more dilemma.

My next suspect is the Bayes DB expiry. I have read in many 
documentation that we need expire and rebuild the Bayes DB for old 
tokens to save disk space from being eaten up. But since i had a lot of 
hard drive space i decided not to expire the database and now my 
database size is 39 M. I feel tht perhaps the database has grown too 
large in size to be effectively parsed. But still I am in confused 
state. Is this the reason of the abnormal behaviour of Bayes ? Is there 
any other reason other than disk space why we need to expire and rebuild 
the database ? Guys please help me out. Atleast let me know what 
probably could be the reason of this abnormal scoring of Bayes.

regards
Rakesh


Re: Unusual Bayes behaviour

Posted by Rakesh <ra...@netcore.co.in>.
Matt Kettler wrote:

> At 03:56 PM 7/17/2004, Rakesh wrote:
>
>> But the kind of feedback mechansim i have implemented in not reading 
>> from any inbox folder but works like this any mails to 
>> nospam@mydomain.com is fed to the sa-learn using a perl wrapper 
>> script. When i checked my logs for a possible HAM feedback during the 
>> time period, I didnt find a single entry for HAM feedback which left 
>> me in more dilemma.
>
>
> What about autolearning? Did you check for that? Recent versions of 
> MailScanner will insert autolearn flags into the spam-hits header.

Yeah Autolearn may be the primary factor involved in the Bayes messup, 
Well will it be ok if i do an autolearn only for SPAM mails and not for 
HAM mails. But I think i may be wrong in this as doing an Autolearn only 
for SPAM messages will give rise to a lot of false positives as the 
Spammers have started using words and phrases that make their mails look 
more HAMMY. I don't what to do ? I think i am getting confused. Shall I 
do one thing. Force Expire my Bayes database and start building a new 
database a fresh. Suggestions Please ?

>
>> My next suspect is the Bayes DB expiry. I have read in many 
>> documentation that we need expire and rebuild the Bayes DB for old 
>> tokens to save disk space from being eaten up. But since i had a lot 
>> of hard drive space i decided not to expire the database and now my 
>> database size is 39 M.
>
>
> OUCH.. don't circumvent the expiry mechanism if you don't understand 
> it's full purpose. It's actually rather important because it weeds-out 
> garbage tokens.
>
> A bayes DB that never expires is *highly* vulnerable to bayes poisoning.
>
Well I sorry towards my wrong approach to Expiry Mechanism.

regards
Rakesh


Re: Unusual Bayes behaviour

Posted by Matt Kettler <mk...@evi-inc.com>.
At 03:56 PM 7/17/2004, Rakesh wrote:
>For a same kind of mail Bayes is giving a variable score. Say i recieved a 
>mail for which Bayes_90 gave a score of 2.101, 3 hours later when the same 
>kind of mail appeared again it was Bayes_60 score with 1.592 and then in 
>the next one hour for the same mail it got a negative value. This has 
>really started making me sweat, Many of the spam messages get detected in 
>one instance and are left undetected in other. I suspected my sa-learn 
>mechanism to be behind this variable score as for the given mails Bayes 
>might have got some more number of HAM tokens during feedback. But the 
>kind of feedback mechansim i have implemented in not reading from any 
>inbox folder but works like this any mails to nospam@mydomain.com is fed 
>to the sa-learn using a perl wrapper script. When i checked my logs for a 
>possible HAM feedback during the time period, I didnt find a single entry 
>for HAM feedback which left me in more dilemma.

What about autolearning? Did you check for that? Recent versions of 
MailScanner will insert autolearn flags into the spam-hits header.

>My next suspect is the Bayes DB expiry. I have read in many documentation 
>that we need expire and rebuild the Bayes DB for old tokens to save disk 
>space from being eaten up. But since i had a lot of hard drive space i 
>decided not to expire the database and now my database size is 39 M.

OUCH.. don't circumvent the expiry mechanism if you don't understand it's 
full purpose. It's actually rather important because it weeds-out garbage 
tokens.

A bayes DB that never expires is *highly* vulnerable to bayes poisoning.