You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Joe Emenaker <jo...@emenaker.com> on 2004/11/09 02:38:45 UTC
A script to auto-adjust spam-threshold. Suggestions?
For some time now, I've had my email system set up so that I have two
"trash" folders. One for ham-trash, and one for spam-trash. Hourly, my
system goes through them and uses them to update the Bayes database.
However, the script that does this *also* records the overall spam score
it received as well as whether it was found in the spam trash or the ham
trash.
Now that I have a lot of data, I have written a script that tallies it
all up and, rather than picking a spam-threshold score, let's me merely
indicate the false-positive or false-negative rate that I prefer... and
then the script figures out what score I need.
The idea is that I would indicate my false-positive or false-negative
preference, and then the script could run once a week, for example, and
adjust my spam-threshold in my SA user preferences.
Since I'm considering putting this into a complete "auto-tuning" kit for
SA, I'm interested in hearing some suggestions.
Right now, my idea is that it would be used through some
user-configuration webpage. As such, the user would need to be presented
with some scenarios. For that purpose, the script can show you scenarios
for a few false-positive and false-negative rates, like this sample
output shows. The first three aim for false-positive rates of 1-in-10,
1-in-100, and 1-in-1000, while the next three aim for the same for
false-negatives:
Spam-Threshold: 0.3
Ham messages lost: 1 in every 10.02
Spam messages allowed: 1 in every 241.92
Spam-Threshold: 8.2
Ham messages lost: 1 in every 118.20
Spam messages allowed: 1 in every 29.58
Spam-Threshold: 15
Ham messages lost: 1 in every 99999.00
Spam messages allowed: 1 in every 2.44
Spam-Threshold: 10
Ham messages lost: 1 in every 147.75
Spam messages allowed: 1 in every 10.32
Spam-Threshold: 5.7
Ham messages lost: 1 in every 45.46
Spam messages allowed: 1 in every 87.74
Spam-Threshold: -5.8
Ham messages lost: 1 in every 1.04
Spam messages allowed: 1 in every 266.18
Now, so that this data would be easy for a cgi script to use in a web
page, it also outputs the data in comma-separated format, in the format of:
"score,<one-FP-in-every-X-messages>,<one-FN-in-every-X-messages>"
0.3,10,241
8.2,118,29
15.0,99999,2
10.0,147,10
5.7,45,87
-5.8,1,266
Now, to get a spam-threshold for, say, one FP in every 500 messages, you
might pass it a command-line argument of "FP:500" and it would just spit
you back a single number. Same would go for a false-negative... passing
something like "FN:500".
Does anybody else out there envision other ways to use this script? Are
there any other features it should have?
- Joe