You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Joe Emenaker <jo...@emenaker.com> on 2004/11/09 02:38:45 UTC

A script to auto-adjust spam-threshold. Suggestions?

For some time now, I've had my email system set up so that I have two 
"trash" folders. One for ham-trash, and one for spam-trash. Hourly, my 
system goes through them and uses them to update the Bayes database. 
However, the script that does this *also* records the overall spam score 
it received as well as whether it was found in the spam trash or the ham 
trash.

Now that I have a lot of data, I have written a script that tallies it 
all up and, rather than picking a spam-threshold score, let's me merely 
indicate the false-positive or false-negative rate that I prefer... and 
then the script figures out what score I need.

The idea is that I would indicate my false-positive or false-negative 
preference, and then the script could run once a week, for example, and 
adjust my spam-threshold in my SA user preferences.

Since I'm considering putting this into a complete "auto-tuning" kit for 
SA, I'm interested in hearing some suggestions.

Right now, my idea is that it would be used through some 
user-configuration webpage. As such, the user would need to be presented 
with some scenarios. For that purpose, the script can show you scenarios 
for a few false-positive and false-negative rates, like this sample 
output shows. The first three aim for false-positive rates of 1-in-10, 
1-in-100, and 1-in-1000, while the next three aim for the same for 
false-negatives:

   Spam-Threshold: 0.3
   Ham messages lost: 1 in every 10.02
   Spam messages allowed: 1 in every 241.92

   Spam-Threshold: 8.2
   Ham messages lost: 1 in every 118.20
   Spam messages allowed: 1 in every 29.58

   Spam-Threshold: 15
   Ham messages lost: 1 in every 99999.00
   Spam messages allowed: 1 in every 2.44

   Spam-Threshold: 10
   Ham messages lost: 1 in every 147.75
   Spam messages allowed: 1 in every 10.32

   Spam-Threshold: 5.7
   Ham messages lost: 1 in every 45.46
   Spam messages allowed: 1 in every 87.74

   Spam-Threshold: -5.8
   Ham messages lost: 1 in every 1.04
   Spam messages allowed: 1 in every 266.18


Now, so that this data would be easy for a cgi script to use in a web 
page, it also outputs the data in comma-separated format, in the format of:
   "score,<one-FP-in-every-X-messages>,<one-FN-in-every-X-messages>"
   0.3,10,241
   8.2,118,29
   15.0,99999,2
   10.0,147,10
   5.7,45,87
   -5.8,1,266

Now, to get a spam-threshold for, say, one FP in every 500 messages, you 
might pass it a command-line argument of "FP:500" and it would just spit 
you back a single number. Same would go for a false-negative... passing 
something like "FN:500".

Does anybody else out there envision other ways to use this script? Are 
there any other features it should have?

- Joe