You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Paul Boven <p....@chello.nl> on 2005/05/06 14:28:06 UTC
The trouble with Bayes
Hi everyone,
Here are some observations on using Bayes and autolearning I would like
to share, and have your input on.
Autolearning is turining out to be more trouble than it's worth.
Although it helps the system to get to know the ham we send and get, and
learn some of the spams on its own, it also tends to 'reward' the 'best'
spammers out there. Spams that hit none of the rules (e.g. the current
deluge of stock-spams) drive the score for all kinds of misspelled words
towards the 'hammy' side of the curve, which makes it possible for more
of that kind of junk to slip trough even if it hits SURBLSs or other rules.
The second weakness in the current Bayes setup concerns the
're-training' of the filter. The assumption in Bayes is that if a mail
gets submitted for training, it will first be 'forgotten' and then
correctly learned as spam (or ham). But in order to 'forget',
SpamAssassin must be able to recognise that the submitted message is the
same as a previously autolearned one. Currently this is done by checking
the MsgID or some checksum of the headers. There are two potential
pitfalls here: Firstly, the retraining message is never exactly the same
as the original message. It's made another hop to the mailstore, or has
been mangled by Exchange or some user agent. Secondly, especially if the
original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID
would not be the same as the original. As soon as that happens,
retraining becomes far less powerfull: when the original faulty
autolearning doesn't get 'forgotten', the retraining will mostly cancel
it out, but never get a chance to correct the Bayes scores for those tokens.
The end-users at my site are fairly good at submitting their spams to
the filter (and fairly vocal if the filter misses too much). But there
are also accounts that are not being read by humans. Accounts that gate
onto mailing-lists. All these get spam too, and the spam gets
autolearned, sometimes in the wrong direction. With retraining only
partially effective as shown above, what happens in the end is that some
spams, by virtue of sheer volume and sameness, manage to bias the filter
in the wrong direction. Surely I'm not the only one who experiences
this, because 'My Bayes has gone bad' is a frequent subject in this forum.
Some suggestions on improving the performance of the Bayes system:
1.) Messages that have been manually submitted should have a higher
'weight' in the Bayes statistics than autolearned messages.
2.) There should be a framework within SpamAssassin that makes it easy
for end-users to submit their spam for training. Currently, there are
all kinds of scripts available outside the main SpamAssassin
distribution (I've written my own, too) that attempt to get the message
out of the mail-client or server and as close as possible to the
original, to feed back to Bayes. Which is close to impossible with some
of the mail-servers out there. SpamAssassin currently only includes half
the Bayes interface: you can have auto-learning, but for manual learning
or retraining you're on your own to some extent.
3.) Message classification should not be on something as fragile as a
mail-header or checksum thereof, but on the actual content. The goal of
this classifier should be to be able to identify a message as being
learned before, despite what has happened to it after having gone trough
SpamAssassin
4.) The Bayes subsystem should store this classification, and all the
tokens it learned. This way we can be sure that we correctly unlearn a
autolearned message. The entries in this database could be timestamped
so they can be removed after some months, to prevent unlimited growth.
Bayes is a very powerfull system, especially for recognising
site-specific ham. But at this moment, apx. 30% of the spam that slips
trough my filter has 'autolearn=ham' set. And another 60% of the spam
slipping trough has a negative Bayes score to help them along. For the
moment, I've disabled the autolearning in my Bayes system.
Regards, Paul Boven.
Re: The trouble with Bayes
Posted by Jim Maul <jm...@elih.org>.
Paul Boven wrote:
> Hi Jim,
>
> Jim Maul wrote:
>
>> Paul Boven wrote:
>
>
>>> Bayes is a very powerfull system, especially for recognising
>>> site-specific ham. But at this moment, apx. 30% of the spam that
>>> slips trough my filter has 'autolearn=ham' set. And another 60% of
>>> the spam slipping trough has a negative Bayes score to help them
>>> along. For the moment, I've disabled the autolearning in my Bayes
>>> system.
>
>
>> If your system is autolearning 30% of the spam as ham it is seriously
>> screwed up.
>
>
> No, fortunately that's not the case. Of all the spam that slips trough
> (which is still just below 1%), about a third doesn't only manage to
> slip trough, but even to get autolearned the wrong way.
>
Ok so not seriously screwed up, only mildly screwed up ;)
>> It only autolearns when its pretty damn sure of its classification of
>> the message in question. A bad bayes database will only continue to
>> get worse if left alone. The trick is starting out good with the
>> learning and its cake from there. On some systems its even less of an
>> issue. I've maybe manually sa-learn'ed 20-30 messages ever in a
>> little over a year using SA. Everything else has been autolearned.
>> Its rare that i see bayes scores other than _00 and _99. I'd say my
>> bayes db is pretty damn accurate at this point, and its done most of
>> it on its own. Now keep in mind that i've altered the scores of some
>> rules (bayes mostly) and i've also adjusted the autolearn thresholds
>> for my system. I've upped the spam and lowered the ham numbers so
>> nothing will be autolearned unless SA is REALLY sure it knows what its
>> doing. I'd tend to think its easier to tweak the system a bit than to
>> change the way bayes/autolearning works..but hey, thats just me.
>
>
> Thanks for your response. What tresholds have you set for autolearning,
> and how exactly do you do your retraining? How many users does your
> SpamAsassin setup have?
>
bayes_auto_learn_threshold_nonspam -0.1
bayes_auto_learn_threshold_spam 10.0
Note that the -0.1 for the ham threshold will cause almost no messages
to be autolearned unless you are running a lot of negative scoring
rules. I have some, but not a lot.
How exactly do i do the retraining? I have my mail account set to leave
the messages on the server. Whenever i have a message which needs
training (i havent had a missed spam in months so the only things that
need training are spam that was tagged but not autolearned because my
threshold is set to 10 and my tagging happens at 5) i just ssh into the
server and run sa-learn on the message in my mailbox.
We have about 100 users and about 2k messages/day
> Over here, the auto-learning treshholds are still at their default
> values (though I've disabled auto-learning for now), re-training is done
> by sending the offending message back to the filter in a Message/RFC822
> attachement and there are about 90 users using the system. My Bayes
> database is in fairly good shape, but some kinds of spam have managed to
> get themselves a negative score.
I realize that my setup is smaller than most so its easier for me to
keep an eye on the system to see what is autolearned. Autolearning
errors need to be corrected immediately or things start to snowball in a
bad way.
So basically, i think the best practice (atleast for my situation) is to
leave autolearning on, but adjust the thresholds so things dont get
learned in either direction unless absolutely sure (really low/really
high scores). Then, everything that gets an autolearn=no can be
manually trained in the correct direction. As always, bayes can work
wonders, but it needs a little hand holding..dont give up on it just yet ;)
-Jim
Re: The trouble with Bayes
Posted by Paul Boven <p....@chello.nl>.
Hi Jim,
Jim Maul wrote:
> Paul Boven wrote:
>> Bayes is a very powerfull system, especially for recognising
>> site-specific ham. But at this moment, apx. 30% of the spam that slips
>> trough my filter has 'autolearn=ham' set. And another 60% of the spam
>> slipping trough has a negative Bayes score to help them along. For the
>> moment, I've disabled the autolearning in my Bayes system.
> If your system is autolearning 30% of the spam as ham it is seriously
> screwed up.
No, fortunately that's not the case. Of all the spam that slips trough
(which is still just below 1%), about a third doesn't only manage to
slip trough, but even to get autolearned the wrong way.
> It only autolearns when its pretty damn sure of its
> classification of the message in question. A bad bayes database will
> only continue to get worse if left alone. The trick is starting out
> good with the learning and its cake from there. On some systems its
> even less of an issue. I've maybe manually sa-learn'ed 20-30 messages
> ever in a little over a year using SA. Everything else has been
> autolearned. Its rare that i see bayes scores other than _00 and _99.
> I'd say my bayes db is pretty damn accurate at this point, and its done
> most of it on its own. Now keep in mind that i've altered the scores of
> some rules (bayes mostly) and i've also adjusted the autolearn
> thresholds for my system. I've upped the spam and lowered the ham
> numbers so nothing will be autolearned unless SA is REALLY sure it knows
> what its doing. I'd tend to think its easier to tweak the system a bit
> than to change the way bayes/autolearning works..but hey, thats just me.
Thanks for your response. What tresholds have you set for autolearning,
and how exactly do you do your retraining? How many users does your
SpamAsassin setup have?
Over here, the auto-learning treshholds are still at their default
values (though I've disabled auto-learning for now), re-training is done
by sending the offending message back to the filter in a Message/RFC822
attachement and there are about 90 users using the system. My Bayes
database is in fairly good shape, but some kinds of spam have managed to
get themselves a negative score.
Regards, Paul Boven.
Re: The trouble with Bayes
Posted by Jim Maul <jm...@elih.org>.
Paul Boven wrote:
> Hi everyone,
>
> Here are some observations on using Bayes and autolearning I would like
> to share, and have your input on.
>
> Autolearning is turining out to be more trouble than it's worth.
> Although it helps the system to get to know the ham we send and get, and
> learn some of the spams on its own, it also tends to 'reward' the 'best'
> spammers out there. Spams that hit none of the rules (e.g. the current
> deluge of stock-spams) drive the score for all kinds of misspelled words
> towards the 'hammy' side of the curve, which makes it possible for more
> of that kind of junk to slip trough even if it hits SURBLSs or other rules.
>
>
<SNIP>
>
> Bayes is a very powerfull system, especially for recognising
> site-specific ham. But at this moment, apx. 30% of the spam that slips
> trough my filter has 'autolearn=ham' set. And another 60% of the spam
> slipping trough has a negative Bayes score to help them along. For the
> moment, I've disabled the autolearning in my Bayes system.
>
> Regards, Paul Boven.
>
>
If your system is autolearning 30% of the spam as ham it is seriously
screwed up. It only autolearns when its pretty damn sure of its
classification of the message in question. A bad bayes database will
only continue to get worse if left alone. The trick is starting out
good with the learning and its cake from there. On some systems its
even less of an issue. I've maybe manually sa-learn'ed 20-30 messages
ever in a little over a year using SA. Everything else has been
autolearned. Its rare that i see bayes scores other than _00 and _99.
I'd say my bayes db is pretty damn accurate at this point, and its done
most of it on its own. Now keep in mind that i've altered the scores of
some rules (bayes mostly) and i've also adjusted the autolearn
thresholds for my system. I've upped the spam and lowered the ham
numbers so nothing will be autolearned unless SA is REALLY sure it knows
what its doing. I'd tend to think its easier to tweak the system a bit
than to change the way bayes/autolearning works..but hey, thats just me.
-Jim
Re: The trouble with Bayes
Posted by James R <ja...@trusswood.dyndns.org>.
Paul Boven wrote:
> Hi everyone,
>
> Here are some observations on using Bayes and autolearning I would like
> to share, and have your input on.
>
> Autolearning is turining out to be more trouble than it's worth.
> Although it helps the system to get to know the ham we send and get, and
> learn some of the spams on its own, it also tends to 'reward' the 'best'
> spammers out there. Spams that hit none of the rules (e.g. the current
> deluge of stock-spams) drive the score for all kinds of misspelled words
> towards the 'hammy' side of the curve, which makes it possible for more
> of that kind of junk to slip trough even if it hits SURBLSs or other rules.
>
> The second weakness in the current Bayes setup concerns the
> 're-training' of the filter. The assumption in Bayes is that if a mail
> gets submitted for training, it will first be 'forgotten' and then
> correctly learned as spam (or ham). But in order to 'forget',
> SpamAssassin must be able to recognise that the submitted message is the
> same as a previously autolearned one. Currently this is done by checking
> the MsgID or some checksum of the headers. There are two potential
> pitfalls here: Firstly, the retraining message is never exactly the same
> as the original message. It's made another hop to the mailstore, or has
> been mangled by Exchange or some user agent. Secondly, especially if the
> original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID
> would not be the same as the original. As soon as that happens,
> retraining becomes far less powerfull: when the original faulty
> autolearning doesn't get 'forgotten', the retraining will mostly cancel
> it out, but never get a chance to correct the Bayes scores for those
> tokens.
>
> The end-users at my site are fairly good at submitting their spams to
> the filter (and fairly vocal if the filter misses too much). But there
> are also accounts that are not being read by humans. Accounts that gate
> onto mailing-lists. All these get spam too, and the spam gets
> autolearned, sometimes in the wrong direction. With retraining only
> partially effective as shown above, what happens in the end is that some
> spams, by virtue of sheer volume and sameness, manage to bias the filter
> in the wrong direction. Surely I'm not the only one who experiences
> this, because 'My Bayes has gone bad' is a frequent subject in this forum.
>
> Some suggestions on improving the performance of the Bayes system:
>
> 1.) Messages that have been manually submitted should have a higher
> 'weight' in the Bayes statistics than autolearned messages.
>
> 2.) There should be a framework within SpamAssassin that makes it easy
> for end-users to submit their spam for training. Currently, there are
> all kinds of scripts available outside the main SpamAssassin
> distribution (I've written my own, too) that attempt to get the message
> out of the mail-client or server and as close as possible to the
> original, to feed back to Bayes. Which is close to impossible with some
> of the mail-servers out there. SpamAssassin currently only includes half
> the Bayes interface: you can have auto-learning, but for manual learning
> or retraining you're on your own to some extent.
>
> 3.) Message classification should not be on something as fragile as a
> mail-header or checksum thereof, but on the actual content. The goal of
> this classifier should be to be able to identify a message as being
> learned before, despite what has happened to it after having gone trough
> SpamAssassin
>
> 4.) The Bayes subsystem should store this classification, and all the
> tokens it learned. This way we can be sure that we correctly unlearn a
> autolearned message. The entries in this database could be timestamped
> so they can be removed after some months, to prevent unlimited growth.
>
> Bayes is a very powerfull system, especially for recognising
> site-specific ham. But at this moment, apx. 30% of the spam that slips
> trough my filter has 'autolearn=ham' set. And another 60% of the spam
> slipping trough has a negative Bayes score to help them along. For the
> moment, I've disabled the autolearning in my Bayes system.
>
> Regards, Paul Boven.
>
>
>
Several of the reasons you've mentioned is the reason I don't do
autolearn. Manual, and user feed back, imho, is the best way to get the
Bayes db up to spam fighting levels. It may be more troublesome for some
ISP's who have a mix of mails have more trouble, but here we have a
pretty standard set of mails, in that, I mean that mail to many of our
users sounds about the same. I can grab out of our archives a few dozen
mails, send them to my server's 'ham' box and let the cron job train those.
As far as standard interface, there is no standard mail
server/os/environment. This is generally something the admin or a 3rd
party would need to have drafted up. I have scrips that were created
from 3rd party parts, munged, and grafted into my own. We here have a
standard mail client, and a standard way of the users submitting mails
to the global spambox. My scripts remove any markup from that
transmission (these are not forwarded, but are redirected) and drop the
mail files into a spam folder, where I look at the mails to make sure
they are of spam quality, then the last step is to move them to where
the Linux server picks up the mails and does the training. You can see
there are a lot of steps, however, this ensures a user doesn't on
accident train the wrong mail.
From my own testing, SA does create a hash of the mail for the msg-id.
We were thinking that if a spamer created a message with the exact
message id every time, that could bypass any training on future near
spam messages, because the learner would ignore the mail. There are
flaws to every system, but the ones in the present one, imo, are not
that bad to the point that makes it unusable. Bayes in SA is very good,
it does a good job, and is easy to setup and train -- but at the same
token, it's easy to hose your db with incorrect training (as you've
seen.) That's why I've got so many steps to the training process. But
then again, I use rbls etc at get go on the SMTP conversation that
blocks a vast majority of spams so I don't have to use BW or store any
spams that made it past the initial SMTP conversation.
--
Thanks,
James
Re: The trouble with Bayes
Posted by Mike Grice <mg...@plus.net>.
On Fri, 2005-05-06 at 14:28 +0200, Paul Boven wrote:
> Hi everyone,
>
> Here are some observations on using Bayes and autolearning I would like
> to share, and have your input on.
>
> Autolearning is turining out to be more trouble than it's worth.
> Although it helps the system to get to know the ham we send and get, and
> learn some of the spams on its own, it also tends to 'reward' the 'best'
> spammers out there. Spams that hit none of the rules (e.g. the current
> deluge of stock-spams) drive the score for all kinds of misspelled words
> towards the 'hammy' side of the curve, which makes it possible for more
> of that kind of junk to slip trough even if it hits SURBLSs or other rules.
>
> The second weakness in the current Bayes setup concerns the
> 're-training' of the filter. The assumption in Bayes is that if a mail
> gets submitted for training, it will first be 'forgotten' and then
> correctly learned as spam (or ham). But in order to 'forget',
> SpamAssassin must be able to recognise that the submitted message is the
> same as a previously autolearned one. Currently this is done by checking
> the MsgID or some checksum of the headers. There are two potential
> pitfalls here: Firstly, the retraining message is never exactly the same
> as the original message. It's made another hop to the mailstore, or has
> been mangled by Exchange or some user agent. Secondly, especially if the
> original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID
> would not be the same as the original. As soon as that happens,
> retraining becomes far less powerfull: when the original faulty
> autolearning doesn't get 'forgotten', the retraining will mostly cancel
> it out, but never get a chance to correct the Bayes scores for those tokens.
DSPAM gets around this by assigning each message a DSPAM-ID, which is
kept in a choice of the body of the mail, attached to the mail, in the
headers. It then keeps a record of every DSPAM-ID and looks for it in
the mail when its sent back for training.
I have problems with this method because it clobbers any database on a
sufficiently high-volume site (as does Bayes and AWL in general). There
must be some other way to do it, but doing multiple writes to a database
for every mail passing through a system is a real resource glutton (and
so I have to have them disabled).
Users have problems with the above method because they don't like extra
stuff in their message (if the DSPAM-ID is at the bottom of every mail,
or attached), and if you put it in the headers a user cannot forward it
(because you don't get the headers in all cases).
Cheers
Mike
--
| Mike Grice Broadband Solutions for
| Systems Engineer Home & Business @
| PlusNet plc. www.plus.net
+ ----- PlusNet - The smarter way to broadband ------
Re: The trouble with Bayes
Posted by Paul Boven <p....@chello.nl>.
Hi Kevin, everyone,
Kevin Peuhkurinen wrote:
> Paul Boven wrote:
>> but my goal is to find a way of doing this that is
>> independent of the rest of the mail-system, and can then become an
>> integral part of SA.
>
> Any suggestions on how to do this? One of SA's strengths is that it is
> designed to be a module that can be plugged into a larger mail flow
> environment rather than acting as a monolithic application. I think
> that any attempt to create a manual training method that suits every
> environment is doomed to failure.
Well, the reason I bring this up is the hope that we can come up with
such a thing. I'm not convinced at this moment that it is already doomed
to failure: if I were, I would not have started this whole discussion.
I currently have a bit of perl that strips message/rfc822 attachements
and feeds them to the learner, which works with a number of clients and
servers. It's run from the alias-file, and has the advantage that
end-users don't need an account on the filter-machine. Disadvantage is
that it is susceptible to all changes that get inflicted on the mail by
those clients and servers.
> You've just pointed out yourself why it is next to impossible for SA to
> associate a unique identification code to a specific email such that it
> will always be able to recognize that email in the future. SA has no
> control nor knowledge of what happens to the email after it has scanned
> it, and it can be altered in numerous ways. So again, do you have any
> better suggestions? As I said in my original email, any attempt to
> create a better means of identifying emails would need to be rigorously
> tested before it could be shown to actually be better than what SA
> already has.
Yes, you are right, this is not an easy puzzle. But it seems to me that
given the way mail clients and servers behave, looking at the content of
the mail could be more robust than the current method. And we don't need
to be able to uniquely identify every email forever, but just to make
sure that any (auto)learning can be undone within a reasonable time.
> It is a fine thing to demonstrate weaknesses in a product. It's a much
> better thing to suggest ways to improve it. Your suggestion to weigh
> manually learned emails heavier than autolearned ones is a good start.
Oh, please don't get me wrong. I'm not just complaining about what's
been bugging me for a while, I'd like to help get this fixed. Discussing
things seems like a good first step in that direction, with some of the
best spam-fighters hanging out in this forum.
Regards, Paul Boven.
Re: The trouble with Bayes
Posted by Kevin Peuhkurinen <ke...@meridiancu.ca>.
Paul Boven wrote:
> Hi Kevin, everyone,
>
> I agree that this would be difficult, but right now we're all facing
> that difficulty on our own, so to speak. A more comprehensive Wiki would
> help, but my goal is to find a way of doing this that is independent of
> the rest of the mail-system, and can then become an integral part of SA.
Any suggestions on how to do this? One of SA's strengths is that it is
designed to be a module that can be plugged into a larger mail flow
environment rather than acting as a monolithic application. I think
that any attempt to create a manual training method that suits every
environment is doomed to failure.
> The current system works well if your mailbox is on the system where you
> run SpamAssassin and you can retrain from the commandline. That's only a
> small subset of all email-users though. Once the setup gets a bit more
> complicated, involves IMAP servers, forwarding etc., you get in trouble.
You've just pointed out yourself why it is next to impossible for SA to
associate a unique identification code to a specific email such that it
will always be able to recognize that email in the future. SA has no
control nor knowledge of what happens to the email after it has scanned
it, and it can be altered in numerous ways. So again, do you have any
better suggestions? As I said in my original email, any attempt to
create a better means of identifying emails would need to be rigorously
tested before it could be shown to actually be better than what SA
already has.
It is a fine thing to demonstrate weaknesses in a product. It's a much
better thing to suggest ways to improve it. Your suggestion to weigh
manually learned emails heavier than autolearned ones is a good start.
Re: The trouble with Bayes
Posted by Paul Boven <p....@chello.nl>.
Hi Kevin, everyone,
Kevin Peuhkurinen wrote:
>> 2.) There should be a framework within SpamAssassin that makes it easy
>> for end-users to submit their spam for training. Currently, there are
>> all kinds of scripts available outside the main SpamAssassin
>> distribution (I've written my own, too) that attempt to get the
>> message out of the mail-client or server and as close as possible to
>> the original, to feed back to Bayes. Which is close to impossible with
>> some of the mail-servers out there. SpamAssassin currently only
>> includes half the Bayes interface: you can have auto-learning, but for
>> manual learning or retraining you're on your own to some extent.
>
> This I have to disagree with you on. SA is used on too many different
> types of systems in too many different environments for it to make any
> sort of sense to try to concoct a one-size-fits-all solution to
> learning. A better approach would be a one-stop source of information
> on how to implement learning in various environments, perhaps here:
> http://wiki.apache.org/spamassassin/BayesInSpamAssassin
I agree that this would be difficult, but right now we're all facing
that difficulty on our own, so to speak. A more comprehensive Wiki would
help, but my goal is to find a way of doing this that is independent of
the rest of the mail-system, and can then become an integral part of SA.
> I agree that basing the classification on message IDs is "fragile", but
> I'm not sure that any other approach would be better. Perhaps an MD5
> sum of the contents not including headers or attachments? It would
> require a fair bit of testing in various real-world environments of
> various methods before you could authoratatively say that one method is
> clearly superior than the one currently used.
The current system works well if your mailbox is on the system where you
run SpamAssassin and you can retrain from the commandline. That's only a
small subset of all email-users though. Once the setup gets a bit more
complicated, involves IMAP servers, forwarding etc., you get in trouble.
>> 4.) The Bayes subsystem should store this classification, and all the
>> tokens it learned. This way we can be sure that we correctly unlearn a
>> autolearned message. The entries in this database could be timestamped
>> so they can be removed after some months, to prevent unlimited growth.
> Sounds like a good idea. However, my Bayes database is already about
> 60MB. A significantly larger database may be a problem for some systems
> with limited storage space.
Fortunately, this would not increase the Bayes token database itself,
only the Bayes_seen database which only gets accessed during
(auto)learning, not during classification.
Regards, Paul Boven.
Re: The trouble with Bayes
Posted by Kevin Peuhkurinen <ke...@meridiancu.ca>.
Paul Boven wrote:
> Hi everyone,
>
> Here are some observations on using Bayes and autolearning I would like
> to share, and have your input on.
Okay!
>
> Some suggestions on improving the performance of the Bayes system:
>
> 1.) Messages that have been manually submitted should have a higher
> 'weight' in the Bayes statistics than autolearned messages.
I agree with you there. It seems to make good sense.
>
> 2.) There should be a framework within SpamAssassin that makes it easy
> for end-users to submit their spam for training. Currently, there are
> all kinds of scripts available outside the main SpamAssassin
> distribution (I've written my own, too) that attempt to get the message
> out of the mail-client or server and as close as possible to the
> original, to feed back to Bayes. Which is close to impossible with some
> of the mail-servers out there. SpamAssassin currently only includes half
> the Bayes interface: you can have auto-learning, but for manual learning
> or retraining you're on your own to some extent.
This I have to disagree with you on. SA is used on too many different
types of systems in too many different environments for it to make any
sort of sense to try to concoct a one-size-fits-all solution to
learning. A better approach would be a one-stop source of information
on how to implement learning in various environments, perhaps here:
http://wiki.apache.org/spamassassin/BayesInSpamAssassin
>
> 3.) Message classification should not be on something as fragile as a
> mail-header or checksum thereof, but on the actual content. The goal of
> this classifier should be to be able to identify a message as being
> learned before, despite what has happened to it after having gone trough
> SpamAssassin
I agree that basing the classification on message IDs is "fragile", but
I'm not sure that any other approach would be better. Perhaps an MD5
sum of the contents not including headers or attachments? It would
require a fair bit of testing in various real-world environments of
various methods before you could authoratatively say that one method is
clearly superior than the one currently used.
>
> 4.) The Bayes subsystem should store this classification, and all the
> tokens it learned. This way we can be sure that we correctly unlearn a
> autolearned message. The entries in this database could be timestamped
> so they can be removed after some months, to prevent unlimited growth.
>
Sounds like a good idea. However, my Bayes database is already about
60MB. A significantly larger database may be a problem for some systems
with limited storage space.
> Bayes is a very powerfull system, especially for recognising
> site-specific ham. But at this moment, apx. 30% of the spam that slips
> trough my filter has 'autolearn=ham' set. And another 60% of the spam
> slipping trough has a negative Bayes score to help them along. For the
> moment, I've disabled the autolearning in my Bayes system.
I'm not sure that my experiences are similar. I don't think that many
of my false negatives are doing better than BAYES_50, but I'll take a
closer look.