You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spamassassin.apache.org by Apache Wiki <wi...@apache.org> on 2007/07/03 16:40:09 UTC

[Spamassassin Wiki] Update of "SpamAssassinChallenge" by JustinMason

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/SpamAssassinChallenge

The comment on the change is:
first draft

New page:
= The SpamAssassin Challenge =

(THIS IS A DRAFT; see [http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376 bug 5376 for discussion])

The [http://www.netflixprize.com/ Netflix Prize] is a machine-learning
challenge from Netflix which 'seeks to substantially improve the accuracy of
predictions about how much someone is going to love a movie based on their
movie preferences.'

We in SpamAssassin have similar problems; maybe we can solve them in a similar way.
We have:

 * a publishable large set of test data
 * some basic rules as to how the test data is interpreted
 * a small set of output values as a result
 * which we can quickly measure to estimate how "good" the output is.

Unfortunately we won't have a prize.  Being able to say "our code is used to
generate SpamAssassin's scores" makes for good bragging rights, though,
I hope ;)

== Input: the test data: mass-check logs ==

We will take the SpamAssassin 3.2.0 mass-check logs, and split them into
test and training sets; 90% for training, 10% for testing, is traditional.
Any cleanups that we had to do during [http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5270 bug 5270] are re-applied.

The test set is saved, and not published.

The training set is published.

== Input: the test data: rules, starting scores, and mutability ==

We can provide "tmp/rules_*.pl" (generated by "build/parse-rules-for-masses").
These are perl data dumps from Data::Dumper, listing every SpamAssassin rule,
it's starting score, a flag indicating if the rule is mutable or not, and other
metadata about it.

== Mutability ==

Mutability of rule scores is a key factor.  Some of the rules in the
SpamAssassin ruleset have immutable scores, typically because:

 * they frequently appear in both ham and spam (therefore should not be given significant scores)
 * or we have chosen to "lock" their scores to specific values, to reduce user confusion (like the Bayes rules)
 * or to ensure that if the rule fires, it will always have a significant value, even though it has never fired yet (the "we dare you" rules)
 * or we reckon that the rule's behaviour is too dependent on user configuration for the score to be reliably estimated ("tflags userconf" rules)

This mutability is defined by us up-front, by selecting where the rule's score
appears in "rules/50_scores.cf" (there are "mutable sections" and "immutable
sections" of the file).

In addition to this, some rules are forced to be immutable by the code
(in "masses/score-ranges-from-freqs"):

 * rules that require user configuration to work ("tflags userconf")
 * network rules (with "tflags net") in score sets 0 and 2
 * trained rules (with "tflags learn") in score sets 1 and 3

Some scores are always forced to be 0 (in "masses/score-ranges-from-freqs").
These are:

 * network rules (with "tflags net") in score sets 0 and 2
 * trained rules (with "tflags learn") in score sets 1 and 3

(Rules with scores of 0 are effectively ignored for that score set, and
are not run at all in the scanner, so this is an optimization.  If you
don't know what a score set is, see MassesOverview.)

In addition, rules that fired on less than 0.01% of messages overall, are
forced to 0.  This is because we cannot reliably estimate what score they
*should* have, due to a lack of data; and also because it's judged that they
won't make a significant difference to results either way.  (Typically if we've
needed to ensure such a rule was active, we'd make it immutable and assign a
score ourselves.)

During SpamAssassin 3.2.0 rescoring, we had 590 mutable rules, and 70
immutable ones.

'''TODO:''' we will need to fix the tmp/rules*.pl file to reflect the limitations
imposed in this section, or generate another file that includes these changes.

== Score Ranges ==

Currently, we don't allow a rescoring algorithm to simply generate any
score for a mutable rule at all.  Instead we have some guidelines:

'''Polarity''': scores for "tflags nice" rules (rules that detect nonspam)
should be below 0, and scores for rules that hit spam should be above 0.
This is important, since if a spam-targeting rule winds up getting a negative
score, spammers will quickly learn to exploit this to give themselves
negative points and get their mails marked as nonspam.

'''No single hit''': scores shouldn't be above 5.0 points; we don't like to
have rules that immediately mark a mail as spam.

'''Magnitude''': we try to keep the maximum score for a rule proportional to
the Bayesian P(spam) probability of the rule.  In other words, a rule that hits
all spam and no ham gets a high score, and a rule that fires on ham 10% of the
time (P(spam) = 0.90) would get a lower score. Similarly, a "tflags nice" rule that hits all ham and no spam would get a
large negative score, whereas a "tflags nice" rule that hit spam 10% of
the time would get a less negative score.  (Note that this is not
necessarily the score the rule will get; it's just the maximum ''possible'' score
that the algorithm is allowed to assign for the rule.)

These are the current limitations of our rescoring algorithm; they're
not hard and fast rules, since there are almost definitely better
ways to do them.  (It's hard to argue with "Polarity" in particular,
though.)

== Output: the scores ==

The output should be a file in the following format:

{{{
score RULE_NAME    1.495
score RULE_2_NAME  3.101
...
}}}

Listing the rule name, and score, one per line, for each mutable rule.
We can then use the "masses/rewrite-cf-with-new-scores" script to insert
those scores into our own scores files, and test FP% / FN% rates with
our own test set of logs.