You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Rakesh <ra...@netcore.co.in> on 2004/07/17 21:56:52 UTC
Unusual Bayes behaviour
hi ,
I have set up a Mailserver with Postfix+MailScanner+SpamAssassin+Cyrus.
Everything went fine for the first 4 months till recently when Bayes
started showing some unusual behaviour.
For a same kind of mail Bayes is giving a variable score. Say i recieved
a mail for which Bayes_90 gave a score of 2.101, 3 hours later when the
same kind of mail appeared again it was Bayes_60 score with 1.592 and
then in the next one hour for the same mail it got a negative value.
This has really started making me sweat, Many of the spam messages get
detected in one instance and are left undetected in other. I suspected
my sa-learn mechanism to be behind this variable score as for the given
mails Bayes might have got some more number of HAM tokens during
feedback. But the kind of feedback mechansim i have implemented in not
reading from any inbox folder but works like this any mails to
nospam@mydomain.com is fed to the sa-learn using a perl wrapper script.
When i checked my logs for a possible HAM feedback during the time
period, I didnt find a single entry for HAM feedback which left me in
more dilemma.
My next suspect is the Bayes DB expiry. I have read in many
documentation that we need expire and rebuild the Bayes DB for old
tokens to save disk space from being eaten up. But since i had a lot of
hard drive space i decided not to expire the database and now my
database size is 39 M. I feel tht perhaps the database has grown too
large in size to be effectively parsed. But still I am in confused
state. Is this the reason of the abnormal behaviour of Bayes ? Is there
any other reason other than disk space why we need to expire and rebuild
the database ? Guys please help me out. Atleast let me know what
probably could be the reason of this abnormal scoring of Bayes.
regards
Rakesh
Re: Unusual Bayes behaviour
Posted by Rakesh <ra...@netcore.co.in>.
Matt Kettler wrote:
> At 03:56 PM 7/17/2004, Rakesh wrote:
>
>> But the kind of feedback mechansim i have implemented in not reading
>> from any inbox folder but works like this any mails to
>> nospam@mydomain.com is fed to the sa-learn using a perl wrapper
>> script. When i checked my logs for a possible HAM feedback during the
>> time period, I didnt find a single entry for HAM feedback which left
>> me in more dilemma.
>
>
> What about autolearning? Did you check for that? Recent versions of
> MailScanner will insert autolearn flags into the spam-hits header.
Yeah Autolearn may be the primary factor involved in the Bayes messup,
Well will it be ok if i do an autolearn only for SPAM mails and not for
HAM mails. But I think i may be wrong in this as doing an Autolearn only
for SPAM messages will give rise to a lot of false positives as the
Spammers have started using words and phrases that make their mails look
more HAMMY. I don't what to do ? I think i am getting confused. Shall I
do one thing. Force Expire my Bayes database and start building a new
database a fresh. Suggestions Please ?
>
>> My next suspect is the Bayes DB expiry. I have read in many
>> documentation that we need expire and rebuild the Bayes DB for old
>> tokens to save disk space from being eaten up. But since i had a lot
>> of hard drive space i decided not to expire the database and now my
>> database size is 39 M.
>
>
> OUCH.. don't circumvent the expiry mechanism if you don't understand
> it's full purpose. It's actually rather important because it weeds-out
> garbage tokens.
>
> A bayes DB that never expires is *highly* vulnerable to bayes poisoning.
>
Well I sorry towards my wrong approach to Expiry Mechanism.
regards
Rakesh
Re: Unusual Bayes behaviour
Posted by Matt Kettler <mk...@evi-inc.com>.
At 03:56 PM 7/17/2004, Rakesh wrote:
>For a same kind of mail Bayes is giving a variable score. Say i recieved a
>mail for which Bayes_90 gave a score of 2.101, 3 hours later when the same
>kind of mail appeared again it was Bayes_60 score with 1.592 and then in
>the next one hour for the same mail it got a negative value. This has
>really started making me sweat, Many of the spam messages get detected in
>one instance and are left undetected in other. I suspected my sa-learn
>mechanism to be behind this variable score as for the given mails Bayes
>might have got some more number of HAM tokens during feedback. But the
>kind of feedback mechansim i have implemented in not reading from any
>inbox folder but works like this any mails to nospam@mydomain.com is fed
>to the sa-learn using a perl wrapper script. When i checked my logs for a
>possible HAM feedback during the time period, I didnt find a single entry
>for HAM feedback which left me in more dilemma.
What about autolearning? Did you check for that? Recent versions of
MailScanner will insert autolearn flags into the spam-hits header.
>My next suspect is the Bayes DB expiry. I have read in many documentation
>that we need expire and rebuild the Bayes DB for old tokens to save disk
>space from being eaten up. But since i had a lot of hard drive space i
>decided not to expire the database and now my database size is 39 M.
OUCH.. don't circumvent the expiry mechanism if you don't understand it's
full purpose. It's actually rather important because it weeds-out garbage
tokens.
A bayes DB that never expires is *highly* vulnerable to bayes poisoning.