You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2007/07/03 16:40:09 UTC
[Spamassassin Wiki] Update of "SpamAssassinChallenge" by JustinMason
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.
The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/SpamAssassinChallenge
The comment on the change is:
first draft
New page:
= The SpamAssassin Challenge =
(THIS IS A DRAFT; see [http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376 bug 5376 for discussion])
The [http://www.netflixprize.com/ Netflix Prize] is a machine-learning
challenge from Netflix which 'seeks to substantially improve the accuracy of
predictions about how much someone is going to love a movie based on their
movie preferences.'
We in SpamAssassin have similar problems; maybe we can solve them in a similar way.
We have:
* a publishable large set of test data
* some basic rules as to how the test data is interpreted
* a small set of output values as a result
* which we can quickly measure to estimate how "good" the output is.
Unfortunately we won't have a prize. Being able to say "our code is used to
generate SpamAssassin's scores" makes for good bragging rights, though,
I hope ;)
== Input: the test data: mass-check logs ==
We will take the SpamAssassin 3.2.0 mass-check logs, and split them into
test and training sets; 90% for training, 10% for testing, is traditional.
Any cleanups that we had to do during [http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5270 bug 5270] are re-applied.
The test set is saved, and not published.
The training set is published.
== Input: the test data: rules, starting scores, and mutability ==
We can provide "tmp/rules_*.pl" (generated by "build/parse-rules-for-masses").
These are perl data dumps from Data::Dumper, listing every SpamAssassin rule,
it's starting score, a flag indicating if the rule is mutable or not, and other
metadata about it.
== Mutability ==
Mutability of rule scores is a key factor. Some of the rules in the
SpamAssassin ruleset have immutable scores, typically because:
* they frequently appear in both ham and spam (therefore should not be given significant scores)
* or we have chosen to "lock" their scores to specific values, to reduce user confusion (like the Bayes rules)
* or to ensure that if the rule fires, it will always have a significant value, even though it has never fired yet (the "we dare you" rules)
* or we reckon that the rule's behaviour is too dependent on user configuration for the score to be reliably estimated ("tflags userconf" rules)
This mutability is defined by us up-front, by selecting where the rule's score
appears in "rules/50_scores.cf" (there are "mutable sections" and "immutable
sections" of the file).
In addition to this, some rules are forced to be immutable by the code
(in "masses/score-ranges-from-freqs"):
* rules that require user configuration to work ("tflags userconf")
* network rules (with "tflags net") in score sets 0 and 2
* trained rules (with "tflags learn") in score sets 1 and 3
Some scores are always forced to be 0 (in "masses/score-ranges-from-freqs").
These are:
* network rules (with "tflags net") in score sets 0 and 2
* trained rules (with "tflags learn") in score sets 1 and 3
(Rules with scores of 0 are effectively ignored for that score set, and
are not run at all in the scanner, so this is an optimization. If you
don't know what a score set is, see MassesOverview.)
In addition, rules that fired on less than 0.01% of messages overall, are
forced to 0. This is because we cannot reliably estimate what score they
*should* have, due to a lack of data; and also because it's judged that they
won't make a significant difference to results either way. (Typically if we've
needed to ensure such a rule was active, we'd make it immutable and assign a
score ourselves.)
During SpamAssassin 3.2.0 rescoring, we had 590 mutable rules, and 70
immutable ones.
'''TODO:''' we will need to fix the tmp/rules*.pl file to reflect the limitations
imposed in this section, or generate another file that includes these changes.
== Score Ranges ==
Currently, we don't allow a rescoring algorithm to simply generate any
score for a mutable rule at all. Instead we have some guidelines:
'''Polarity''': scores for "tflags nice" rules (rules that detect nonspam)
should be below 0, and scores for rules that hit spam should be above 0.
This is important, since if a spam-targeting rule winds up getting a negative
score, spammers will quickly learn to exploit this to give themselves
negative points and get their mails marked as nonspam.
'''No single hit''': scores shouldn't be above 5.0 points; we don't like to
have rules that immediately mark a mail as spam.
'''Magnitude''': we try to keep the maximum score for a rule proportional to
the Bayesian P(spam) probability of the rule. In other words, a rule that hits
all spam and no ham gets a high score, and a rule that fires on ham 10% of the
time (P(spam) = 0.90) would get a lower score. Similarly, a "tflags nice" rule that hits all ham and no spam would get a
large negative score, whereas a "tflags nice" rule that hit spam 10% of
the time would get a less negative score. (Note that this is not
necessarily the score the rule will get; it's just the maximum ''possible'' score
that the algorithm is allowed to assign for the rule.)
These are the current limitations of our rescoring algorithm; they're
not hard and fast rules, since there are almost definitely better
ways to do them. (It's hard to argue with "Polarity" in particular,
though.)
== Output: the scores ==
The output should be a file in the following format:
{{{
score RULE_NAME 1.495
score RULE_2_NAME 3.101
...
}}}
Listing the rule name, and score, one per line, for each mutable rule.
We can then use the "masses/rewrite-cf-with-new-scores" script to insert
those scores into our own scores files, and test FP% / FN% rates with
our own test set of logs.