You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Paul Boven <p....@chello.nl> on 2005/05/06 14:28:06 UTC

The trouble with Bayes

Hi everyone,

Here are some observations on using Bayes and autolearning I would like 
to share, and have your input on.

Autolearning is turining out to be more trouble than it's worth. 
Although it helps the system to get to know the ham we send and get, and 
learn some of the spams on its own, it also tends to 'reward' the 'best' 
spammers out there. Spams that hit none of the rules (e.g. the current 
deluge of stock-spams) drive the score for all kinds of misspelled words 
towards the 'hammy' side of the curve, which makes it possible for more 
of that kind of junk to slip trough even if it hits SURBLSs or other rules.

The second weakness in the current Bayes setup concerns the 
're-training' of the filter. The assumption in Bayes is that if a mail 
gets submitted for training, it will first be 'forgotten' and then 
correctly learned as spam (or ham). But in order to 'forget', 
SpamAssassin must be able to recognise that the submitted message is the 
same as a previously autolearned one. Currently this is done by checking 
the MsgID or some checksum of the headers. There are two potential 
pitfalls here: Firstly, the retraining message is never exactly the same 
as the original message. It's made another hop to the mailstore, or has 
been mangled by Exchange or some user agent. Secondly, especially if the 
original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID 
would not be the same as the original. As soon as that happens, 
retraining becomes far less powerfull: when the original faulty 
autolearning doesn't get 'forgotten', the retraining will mostly cancel 
it out, but never get a chance to correct the Bayes scores for those tokens.

The end-users at my site are fairly good at submitting their spams to 
the filter (and fairly vocal if the filter misses too much). But there 
are also accounts that are not being read by humans. Accounts that gate 
onto mailing-lists. All these get spam too, and the spam gets 
autolearned, sometimes in the wrong direction. With retraining only 
partially effective as shown above, what happens in the end is that some 
spams, by virtue of sheer volume and sameness, manage to bias the filter 
in the wrong direction. Surely I'm not the only one who experiences 
this, because 'My Bayes has gone bad' is a frequent subject in this forum.

Some suggestions on improving the performance of the Bayes system:

1.) Messages that have been manually submitted should have a higher 
'weight' in the Bayes statistics than autolearned messages.

2.) There should be a framework within SpamAssassin that makes it easy 
for end-users to submit their spam for training. Currently, there are 
all kinds of scripts available outside the main SpamAssassin 
distribution (I've written my own, too) that attempt to get the message 
out of the mail-client or server and as close as possible to the 
original, to feed back to Bayes. Which is close to impossible with some 
of the mail-servers out there. SpamAssassin currently only includes half 
the Bayes interface: you can have auto-learning, but for manual learning 
or retraining you're on your own to some extent.

3.) Message classification should not be on something as fragile as a 
mail-header or checksum thereof, but on the actual content. The goal of 
this classifier should be to be able to identify a message as being 
learned before, despite what has happened to it after having gone trough 
SpamAssassin

4.) The Bayes subsystem should store this classification, and all the 
tokens it learned. This way we can be sure that we correctly unlearn a 
autolearned message. The entries in this database could be timestamped 
so they can be removed after some months, to prevent unlimited growth.

Bayes is a very powerfull system, especially for recognising 
site-specific ham. But at this moment, apx. 30% of the spam that slips 
trough my filter has 'autolearn=ham' set. And another 60% of the spam 
slipping trough has a negative Bayes score to help them along. For the 
moment, I've disabled the autolearning in my Bayes system.

Regards, Paul Boven.


Re: The trouble with Bayes

Posted by Jim Maul <jm...@elih.org>.
Paul Boven wrote:
> Hi Jim,
> 
> Jim Maul wrote:
> 
>> Paul Boven wrote:
> 
> 
>>> Bayes is a very powerfull system, especially for recognising 
>>> site-specific ham. But at this moment, apx. 30% of the spam that 
>>> slips trough my filter has 'autolearn=ham' set. And another 60% of 
>>> the spam slipping trough has a negative Bayes score to help them 
>>> along. For the moment, I've disabled the autolearning in my Bayes 
>>> system.
> 
> 
>> If your system is autolearning 30% of the spam as ham it is seriously 
>> screwed up.
> 
> 
> No, fortunately that's not the case. Of all the spam that slips trough 
> (which is still just below 1%), about a third doesn't only manage to 
> slip trough, but even to get autolearned the wrong way.
> 

Ok so not seriously screwed up, only mildly screwed up ;)


>> It only autolearns when its pretty damn sure of its classification of 
>> the message in question.  A bad bayes database will only continue to 
>> get worse if left alone.  The trick is starting out good with the 
>> learning and its cake from there.  On some systems its even less of an 
>> issue.  I've maybe manually sa-learn'ed 20-30 messages ever in a 
>> little over a year using SA.  Everything else has been autolearned.  
>> Its rare that i see bayes scores other than _00 and _99. I'd say my 
>> bayes db is pretty damn accurate at this point, and its done most of 
>> it on its own.  Now keep in mind that i've altered the scores of some 
>> rules (bayes mostly) and i've also adjusted the autolearn thresholds 
>> for my system.  I've upped the spam and lowered the ham numbers so 
>> nothing will be autolearned unless SA is REALLY sure it knows what its 
>> doing.  I'd tend to think its easier to tweak the system a bit than to 
>> change the way bayes/autolearning works..but hey, thats just me.
> 
> 
> Thanks for your response. What tresholds have you set for autolearning, 
> and how exactly do you do your retraining? How many users does your 
> SpamAsassin setup have?
> 

bayes_auto_learn_threshold_nonspam -0.1
bayes_auto_learn_threshold_spam 10.0

Note that the -0.1 for the ham threshold will cause almost no messages 
to be autolearned unless you are running a lot of negative scoring 
rules.  I have some, but not a lot.

How exactly do i do the retraining?  I have my mail account set to leave 
the messages on the server.  Whenever i have a message which needs 
training (i havent had a missed spam in months so the only things that 
need training are spam that was tagged but not autolearned because my 
threshold is set to 10 and my tagging happens at 5) i just ssh into the 
server and run sa-learn on the message in my mailbox.

We have about 100 users and about 2k messages/day

> Over here, the auto-learning treshholds are still at their default 
> values (though I've disabled auto-learning for now), re-training is done 
> by sending the offending message back to the filter in a Message/RFC822 
> attachement and there are about 90 users using the system. My Bayes 
> database is in fairly good shape, but some kinds of spam have managed to 
> get themselves a negative score.

I realize that my setup is smaller than most so its easier for me to 
keep an eye on the system to see what is autolearned.  Autolearning 
errors need to be corrected immediately or things start to snowball in a 
bad way.

So basically, i think the best practice (atleast for my situation) is to 
leave autolearning on, but adjust the thresholds so things dont get 
learned in either direction unless absolutely sure (really low/really 
high scores).  Then, everything that gets an autolearn=no can be 
manually trained in the correct direction.  As always, bayes can work 
wonders, but it needs a little hand holding..dont give up on it just yet ;)

-Jim

Re: The trouble with Bayes

Posted by Paul Boven <p....@chello.nl>.
Hi Jim,

Jim Maul wrote:
> Paul Boven wrote:

>> Bayes is a very powerfull system, especially for recognising 
>> site-specific ham. But at this moment, apx. 30% of the spam that slips 
>> trough my filter has 'autolearn=ham' set. And another 60% of the spam 
>> slipping trough has a negative Bayes score to help them along. For the 
>> moment, I've disabled the autolearning in my Bayes system.

> If your system is autolearning 30% of the spam as ham it is seriously 
> screwed up.

No, fortunately that's not the case. Of all the spam that slips trough 
(which is still just below 1%), about a third doesn't only manage to 
slip trough, but even to get autolearned the wrong way.

> It only autolearns when its pretty damn sure of its 
> classification of the message in question.  A bad bayes database will 
> only continue to get worse if left alone.  The trick is starting out 
> good with the learning and its cake from there.  On some systems its 
> even less of an issue.  I've maybe manually sa-learn'ed 20-30 messages 
> ever in a little over a year using SA.  Everything else has been 
> autolearned.  Its rare that i see bayes scores other than _00 and _99. 
> I'd say my bayes db is pretty damn accurate at this point, and its done 
> most of it on its own.  Now keep in mind that i've altered the scores of 
> some rules (bayes mostly) and i've also adjusted the autolearn 
> thresholds for my system.  I've upped the spam and lowered the ham 
> numbers so nothing will be autolearned unless SA is REALLY sure it knows 
> what its doing.  I'd tend to think its easier to tweak the system a bit 
> than to change the way bayes/autolearning works..but hey, thats just me.

Thanks for your response. What tresholds have you set for autolearning, 
and how exactly do you do your retraining? How many users does your 
SpamAsassin setup have?

Over here, the auto-learning treshholds are still at their default 
values (though I've disabled auto-learning for now), re-training is done 
by sending the offending message back to the filter in a Message/RFC822 
attachement and there are about 90 users using the system. My Bayes 
database is in fairly good shape, but some kinds of spam have managed to 
get themselves a negative score.

Regards, Paul Boven.

Re: The trouble with Bayes

Posted by Jim Maul <jm...@elih.org>.
Paul Boven wrote:
> Hi everyone,
> 
> Here are some observations on using Bayes and autolearning I would like 
> to share, and have your input on.
> 
> Autolearning is turining out to be more trouble than it's worth. 
> Although it helps the system to get to know the ham we send and get, and 
> learn some of the spams on its own, it also tends to 'reward' the 'best' 
> spammers out there. Spams that hit none of the rules (e.g. the current 
> deluge of stock-spams) drive the score for all kinds of misspelled words 
> towards the 'hammy' side of the curve, which makes it possible for more 
> of that kind of junk to slip trough even if it hits SURBLSs or other rules.
> 
> 

<SNIP>

> 
> Bayes is a very powerfull system, especially for recognising 
> site-specific ham. But at this moment, apx. 30% of the spam that slips 
> trough my filter has 'autolearn=ham' set. And another 60% of the spam 
> slipping trough has a negative Bayes score to help them along. For the 
> moment, I've disabled the autolearning in my Bayes system.
> 
> Regards, Paul Boven.
> 
> 

If your system is autolearning 30% of the spam as ham it is seriously 
screwed up.  It only autolearns when its pretty damn sure of its 
classification of the message in question.  A bad bayes database will 
only continue to get worse if left alone.  The trick is starting out 
good with the learning and its cake from there.  On some systems its 
even less of an issue.  I've maybe manually sa-learn'ed 20-30 messages 
ever in a little over a year using SA.  Everything else has been 
autolearned.  Its rare that i see bayes scores other than _00 and _99. 
I'd say my bayes db is pretty damn accurate at this point, and its done 
most of it on its own.  Now keep in mind that i've altered the scores of 
some rules (bayes mostly) and i've also adjusted the autolearn 
thresholds for my system.  I've upped the spam and lowered the ham 
numbers so nothing will be autolearned unless SA is REALLY sure it knows 
what its doing.  I'd tend to think its easier to tweak the system a bit 
than to change the way bayes/autolearning works..but hey, thats just me.

-Jim

Re: The trouble with Bayes

Posted by James R <ja...@trusswood.dyndns.org>.
Paul Boven wrote:
> Hi everyone,
> 
> Here are some observations on using Bayes and autolearning I would like 
> to share, and have your input on.
> 
> Autolearning is turining out to be more trouble than it's worth. 
> Although it helps the system to get to know the ham we send and get, and 
> learn some of the spams on its own, it also tends to 'reward' the 'best' 
> spammers out there. Spams that hit none of the rules (e.g. the current 
> deluge of stock-spams) drive the score for all kinds of misspelled words 
> towards the 'hammy' side of the curve, which makes it possible for more 
> of that kind of junk to slip trough even if it hits SURBLSs or other rules.
> 
> The second weakness in the current Bayes setup concerns the 
> 're-training' of the filter. The assumption in Bayes is that if a mail 
> gets submitted for training, it will first be 'forgotten' and then 
> correctly learned as spam (or ham). But in order to 'forget', 
> SpamAssassin must be able to recognise that the submitted message is the 
> same as a previously autolearned one. Currently this is done by checking 
> the MsgID or some checksum of the headers. There are two potential 
> pitfalls here: Firstly, the retraining message is never exactly the same 
> as the original message. It's made another hop to the mailstore, or has 
> been mangled by Exchange or some user agent. Secondly, especially if the 
> original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID 
> would not be the same as the original. As soon as that happens, 
> retraining becomes far less powerfull: when the original faulty 
> autolearning doesn't get 'forgotten', the retraining will mostly cancel 
> it out, but never get a chance to correct the Bayes scores for those 
> tokens.
> 
> The end-users at my site are fairly good at submitting their spams to 
> the filter (and fairly vocal if the filter misses too much). But there 
> are also accounts that are not being read by humans. Accounts that gate 
> onto mailing-lists. All these get spam too, and the spam gets 
> autolearned, sometimes in the wrong direction. With retraining only 
> partially effective as shown above, what happens in the end is that some 
> spams, by virtue of sheer volume and sameness, manage to bias the filter 
> in the wrong direction. Surely I'm not the only one who experiences 
> this, because 'My Bayes has gone bad' is a frequent subject in this forum.
> 
> Some suggestions on improving the performance of the Bayes system:
> 
> 1.) Messages that have been manually submitted should have a higher 
> 'weight' in the Bayes statistics than autolearned messages.
> 
> 2.) There should be a framework within SpamAssassin that makes it easy 
> for end-users to submit their spam for training. Currently, there are 
> all kinds of scripts available outside the main SpamAssassin 
> distribution (I've written my own, too) that attempt to get the message 
> out of the mail-client or server and as close as possible to the 
> original, to feed back to Bayes. Which is close to impossible with some 
> of the mail-servers out there. SpamAssassin currently only includes half 
> the Bayes interface: you can have auto-learning, but for manual learning 
> or retraining you're on your own to some extent.
> 
> 3.) Message classification should not be on something as fragile as a 
> mail-header or checksum thereof, but on the actual content. The goal of 
> this classifier should be to be able to identify a message as being 
> learned before, despite what has happened to it after having gone trough 
> SpamAssassin
> 
> 4.) The Bayes subsystem should store this classification, and all the 
> tokens it learned. This way we can be sure that we correctly unlearn a 
> autolearned message. The entries in this database could be timestamped 
> so they can be removed after some months, to prevent unlimited growth.
> 
> Bayes is a very powerfull system, especially for recognising 
> site-specific ham. But at this moment, apx. 30% of the spam that slips 
> trough my filter has 'autolearn=ham' set. And another 60% of the spam 
> slipping trough has a negative Bayes score to help them along. For the 
> moment, I've disabled the autolearning in my Bayes system.
> 
> Regards, Paul Boven.
> 
> 
> 

Several of the reasons you've mentioned is the reason I don't do 
autolearn. Manual, and user feed back, imho, is the best way to get the 
Bayes db up to spam fighting levels. It may be more troublesome for some 
ISP's who have a mix of mails have more trouble, but here we have a 
pretty standard set of mails, in that, I mean that mail to many of our 
users sounds about the same. I can grab out of our archives a few dozen 
mails, send them to my server's 'ham' box and let the cron job train those.

As far as standard interface, there is no standard mail 
server/os/environment. This is generally something the admin or a 3rd 
party would need to have drafted up. I have scrips that were created 
from 3rd party parts, munged, and grafted into my own. We here have a 
standard mail client, and a standard way of the users submitting mails 
to the global spambox. My scripts remove any markup from that 
transmission (these are not forwarded, but are redirected) and drop the 
mail files into a spam folder, where I look at the mails to make sure 
they are of spam quality, then the last step is to move them to where 
the Linux server picks up the mails and does the training. You can see 
there are a lot of steps, however, this ensures a user doesn't on 
accident train the wrong mail.

 From my own testing, SA does create a hash of the mail for the msg-id. 
We were thinking that if a spamer created a message with the exact 
message id every time, that could bypass any training on future near 
spam messages, because the learner would ignore the mail. There are 
flaws to every system, but the ones in the present one, imo, are not 
that bad to the point that makes it unusable. Bayes in SA is very good, 
it does a good job, and is easy to setup and train -- but at the same 
token, it's easy to hose your db with incorrect training (as you've 
seen.) That's why I've got so many steps to the training process. But 
then again, I use rbls etc at get go on the SMTP conversation that 
blocks a vast majority of spams so I don't have to use BW or store any 
spams that made it past the initial SMTP conversation.

-- 
Thanks,
James

Re: The trouble with Bayes

Posted by Mike Grice <mg...@plus.net>.
On Fri, 2005-05-06 at 14:28 +0200, Paul Boven wrote:
> Hi everyone,
> 
> Here are some observations on using Bayes and autolearning I would like 
> to share, and have your input on.
> 
> Autolearning is turining out to be more trouble than it's worth. 
> Although it helps the system to get to know the ham we send and get, and 
> learn some of the spams on its own, it also tends to 'reward' the 'best' 
> spammers out there. Spams that hit none of the rules (e.g. the current 
> deluge of stock-spams) drive the score for all kinds of misspelled words 
> towards the 'hammy' side of the curve, which makes it possible for more 
> of that kind of junk to slip trough even if it hits SURBLSs or other rules.
> 
> The second weakness in the current Bayes setup concerns the 
> 're-training' of the filter. The assumption in Bayes is that if a mail 
> gets submitted for training, it will first be 'forgotten' and then 
> correctly learned as spam (or ham). But in order to 'forget', 
> SpamAssassin must be able to recognise that the submitted message is the 
> same as a previously autolearned one. Currently this is done by checking 
> the MsgID or some checksum of the headers. There are two potential 
> pitfalls here: Firstly, the retraining message is never exactly the same 
> as the original message. It's made another hop to the mailstore, or has 
> been mangled by Exchange or some user agent. Secondly, especially if the 
> original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID 
> would not be the same as the original. As soon as that happens, 
> retraining becomes far less powerfull: when the original faulty 
> autolearning doesn't get 'forgotten', the retraining will mostly cancel 
> it out, but never get a chance to correct the Bayes scores for those tokens.

DSPAM gets around this by assigning each message a DSPAM-ID, which is
kept in a choice of the body of the mail, attached to the mail, in the
headers.  It then keeps a record of every DSPAM-ID and looks for it in
the mail when its sent back for training.

I have problems with this method because it clobbers any database on a
sufficiently high-volume site (as does Bayes and AWL in general).  There
must be some other way to do it, but doing multiple writes to a database
for every mail passing through a system is a real resource glutton (and
so I have to have them disabled).

Users have problems with the above method because they don't like extra
stuff in their message (if the DSPAM-ID is at the bottom of every mail,
or attached), and if you put it in the headers a user cannot forward it
(because you don't get the headers in all cases).

Cheers
Mike

-- 
| Mike Grice                  Broadband Solutions for
| Systems Engineer                  Home & Business @
| PlusNet plc.                           www.plus.net
+ ----- PlusNet - The smarter way to broadband ------


Re: The trouble with Bayes

Posted by Paul Boven <p....@chello.nl>.
Hi Kevin, everyone,

Kevin Peuhkurinen wrote:
> Paul Boven wrote:

>> but my goal is to find a way of doing this that is 
>> independent of the rest of the mail-system, and can then become an 
>> integral part of SA.
> 
> Any suggestions on how to do this?  One of SA's strengths is that it is 
> designed to be a module that can be plugged into a larger mail flow 
> environment rather than acting as a monolithic application.   I think 
> that any attempt to create a manual training method that suits every 
> environment is doomed to failure.

Well, the reason I bring this up is the hope that we can come up with 
such a thing. I'm not convinced at this moment that it is already doomed 
to failure: if I were, I would not have started this whole discussion.

I currently have a bit of perl that strips message/rfc822 attachements 
and feeds them to the learner, which works with a number of clients and 
servers. It's run from the alias-file, and has the advantage that 
end-users don't need an account on the filter-machine. Disadvantage is 
that it is susceptible to all changes that get inflicted on the mail by 
those clients and servers.

> You've just pointed out yourself why it is next to impossible for SA to 
> associate a unique identification code to a specific email such that it 
> will always be able to recognize that email in the future.   SA has no 
> control nor knowledge of what happens to the email after it has scanned 
> it, and it can be altered in numerous ways.   So again, do you have any 
> better suggestions?   As I said in my original email, any attempt to 
> create a better means of identifying emails would need to be rigorously 
> tested before it could be shown to actually be better than what SA 
> already has.

Yes, you are right, this is not an easy puzzle. But it seems to me that 
given the way mail clients and servers behave, looking at the content of 
the mail could be more robust than the current method. And we don't need 
to be able to uniquely identify every email forever, but just to make 
sure that any (auto)learning can be undone within a reasonable time.

> It is a fine thing to demonstrate weaknesses in a product.   It's a much 
> better thing to suggest ways to improve it.  Your suggestion to weigh 
> manually learned emails heavier than autolearned ones is a good start.

Oh, please don't get me wrong. I'm not just complaining about what's 
been bugging me for a while, I'd like to help get this fixed. Discussing 
things seems like a good first step in that direction, with some of the 
best spam-fighters hanging out in this forum.

Regards, Paul Boven.


Re: The trouble with Bayes

Posted by Kevin Peuhkurinen <ke...@meridiancu.ca>.
Paul Boven wrote:

> Hi Kevin, everyone,
> 
> I agree that this would be difficult, but right now we're all facing 
> that difficulty on our own, so to speak. A more comprehensive Wiki would 
> help, but my goal is to find a way of doing this that is independent of 
> the rest of the mail-system, and can then become an integral part of SA.

Any suggestions on how to do this?  One of SA's strengths is that it is 
designed to be a module that can be plugged into a larger mail flow 
environment rather than acting as a monolithic application.   I think 
that any attempt to create a manual training method that suits every 
environment is doomed to failure.


> The current system works well if your mailbox is on the system where you 
> run SpamAssassin and you can retrain from the commandline. That's only a 
> small subset of all email-users though. Once the setup gets a bit more 
> complicated, involves IMAP servers, forwarding etc., you get in trouble.

You've just pointed out yourself why it is next to impossible for SA to 
associate a unique identification code to a specific email such that it 
will always be able to recognize that email in the future.   SA has no 
control nor knowledge of what happens to the email after it has scanned 
it, and it can be altered in numerous ways.   So again, do you have any 
better suggestions?   As I said in my original email, any attempt to 
create a better means of identifying emails would need to be rigorously 
tested before it could be shown to actually be better than what SA 
already has.

It is a fine thing to demonstrate weaknesses in a product.   It's a much 
better thing to suggest ways to improve it.  Your suggestion to weigh 
manually learned emails heavier than autolearned ones is a good start.

Re: The trouble with Bayes

Posted by Paul Boven <p....@chello.nl>.
Hi Kevin, everyone,

Kevin Peuhkurinen wrote:

>> 2.) There should be a framework within SpamAssassin that makes it easy 
>> for end-users to submit their spam for training. Currently, there are 
>> all kinds of scripts available outside the main SpamAssassin 
>> distribution (I've written my own, too) that attempt to get the 
>> message out of the mail-client or server and as close as possible to 
>> the original, to feed back to Bayes. Which is close to impossible with 
>> some of the mail-servers out there. SpamAssassin currently only 
>> includes half the Bayes interface: you can have auto-learning, but for 
>> manual learning or retraining you're on your own to some extent.
> 
> This I have to disagree with you on.   SA is used on too many different 
> types of systems in too many different environments for it to make any 
> sort of sense to try to concoct a one-size-fits-all solution to 
> learning.  A better approach would be a one-stop source of information 
> on how to implement learning in various environments, perhaps here: 
> http://wiki.apache.org/spamassassin/BayesInSpamAssassin

I agree that this would be difficult, but right now we're all facing 
that difficulty on our own, so to speak. A more comprehensive Wiki would 
help, but my goal is to find a way of doing this that is independent of 
the rest of the mail-system, and can then become an integral part of SA.

> I agree that basing the classification on message IDs is "fragile", but 
> I'm not sure that any other approach would be better.   Perhaps an MD5 
> sum of the contents not including headers or attachments?  It would 
> require a fair bit of testing in various real-world environments of 
> various methods before you could authoratatively say that one method is 
> clearly superior than the one currently used.

The current system works well if your mailbox is on the system where you 
run SpamAssassin and you can retrain from the commandline. That's only a 
small subset of all email-users though. Once the setup gets a bit more 
complicated, involves IMAP servers, forwarding etc., you get in trouble.

>> 4.) The Bayes subsystem should store this classification, and all the 
>> tokens it learned. This way we can be sure that we correctly unlearn a 
>> autolearned message. The entries in this database could be timestamped 
>> so they can be removed after some months, to prevent unlimited growth.

> Sounds like a good idea.  However, my Bayes database is already about 
> 60MB.  A significantly larger database may be a problem for some systems 
> with limited storage space.

Fortunately, this would not increase the Bayes token database itself, 
only the Bayes_seen database which only gets accessed during 
(auto)learning, not during classification.

Regards, Paul Boven.

Re: The trouble with Bayes

Posted by Kevin Peuhkurinen <ke...@meridiancu.ca>.
Paul Boven wrote:

> Hi everyone,
> 
> Here are some observations on using Bayes and autolearning I would like 
> to share, and have your input on.

Okay!

> 
> Some suggestions on improving the performance of the Bayes system:
> 
> 1.) Messages that have been manually submitted should have a higher 
> 'weight' in the Bayes statistics than autolearned messages.

I agree with you there.  It seems to make good sense.

> 
> 2.) There should be a framework within SpamAssassin that makes it easy 
> for end-users to submit their spam for training. Currently, there are 
> all kinds of scripts available outside the main SpamAssassin 
> distribution (I've written my own, too) that attempt to get the message 
> out of the mail-client or server and as close as possible to the 
> original, to feed back to Bayes. Which is close to impossible with some 
> of the mail-servers out there. SpamAssassin currently only includes half 
> the Bayes interface: you can have auto-learning, but for manual learning 
> or retraining you're on your own to some extent.

This I have to disagree with you on.   SA is used on too many different 
types of systems in too many different environments for it to make any 
sort of sense to try to concoct a one-size-fits-all solution to 
learning.  A better approach would be a one-stop source of information 
on how to implement learning in various environments, perhaps here: 
http://wiki.apache.org/spamassassin/BayesInSpamAssassin

> 
> 3.) Message classification should not be on something as fragile as a 
> mail-header or checksum thereof, but on the actual content. The goal of 
> this classifier should be to be able to identify a message as being 
> learned before, despite what has happened to it after having gone trough 
> SpamAssassin

I agree that basing the classification on message IDs is "fragile", but 
I'm not sure that any other approach would be better.   Perhaps an MD5 
sum of the contents not including headers or attachments?  It would 
require a fair bit of testing in various real-world environments of 
various methods before you could authoratatively say that one method is 
clearly superior than the one currently used.

> 
> 4.) The Bayes subsystem should store this classification, and all the 
> tokens it learned. This way we can be sure that we correctly unlearn a 
> autolearned message. The entries in this database could be timestamped 
> so they can be removed after some months, to prevent unlimited growth.
>

Sounds like a good idea.  However, my Bayes database is already about 
60MB.  A significantly larger database may be a problem for some systems 
with limited storage space.

> Bayes is a very powerfull system, especially for recognising 
> site-specific ham. But at this moment, apx. 30% of the spam that slips 
> trough my filter has 'autolearn=ham' set. And another 60% of the spam 
> slipping trough has a negative Bayes score to help them along. For the 
> moment, I've disabled the autolearning in my Bayes system.

I'm not sure that my experiences are similar.  I don't think that many 
of my false negatives are doing better than BAYES_50, but I'll take a
closer look.