You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Robert Menschel <Ro...@Menschel.net> on 2005/07/25 07:28:03 UTC

Streamlining Rules Process

Saturday, July 23, 2005, 8:36:58 PM, Duncan wrote:

DF>  * We discussed at length the ideas for the new rules project, and we
DF> came up with some ideas, which we're trying to track
DF> http://wiki.apache.org/spamassassin/RulesProjectPlan (Please give us
DF> your feedback)

http://wiki.apache.org/spamassassin/RulesProjStreamlining

One item not mentioned on this page yet is how to score rules going to
either core and rapid distribution such as via sa-update or going to
the extra rule sets.

The ideal would be to find some way to incorporate new rules into a
GA/Perceptron-line mechanism, perhaps a Perceptron run which a)
assumes whatever hit frequency applied to the last full scoring run,
b) freezes all scores in all score sets according to the most recent
distribution, and then c) incorporates an sa-update scoring run and
calculates appropriate scores for the new rules.

If that's not practical, then perhaps we can use some standardized
algorithms to determine provisional scores. The algorithms we use
for general purpose rules within SARE seem to work very well, adding
significantly to spam scores without causing any significant number of
FPs.

Would it be appropriate for me to post those algorithms in the wiki
as part of a "scoring" discussion? I'm thinking this could easily grow
to warrant a page of its own...

Bob Menschel




Re: Streamlining Rules Process

Posted by Daniel Quinlan <qu...@pathname.com>.
Robert Menschel <Ro...@Menschel.net> writes:

> One item not mentioned on this page yet is how to score rules going to
> either core and rapid distribution such as via sa-update or going to
> the extra rule sets.

Current practice is that new rules temporarily get the default score of
1.0.  We plan to rescore much more often in the future, though.  The new
scoring method makes it much easier and once we get the kinks worked
out, I think we'll be able to do it much more often than we have in the
past.

One option to bridge the gap would be to score new rules based on a
nightly run using the more limited corpora of a nightly run.  This would
be done by setting old rules' scores to be immutable and only scoring
the new ones.  That would not be too hard and would be more accurate
than any estimation technique.  There is definitely a correlation
between hit rates, S/O ratio, RANK, etc. to the ultimate
perceptron-generated score, but the correlations are not all that high,
unfortunately.

> The ideal would be to find some way to incorporate new rules into a
> GA/Perceptron-line mechanism, perhaps a Perceptron run which a)
> assumes whatever hit frequency applied to the last full scoring run,
> b) freezes all scores in all score sets according to the most recent
> distribution, and then c) incorporates an sa-update scoring run and
> calculates appropriate scores for the new rules.  [...]

Ah, very good.  I should have read your entire message.  ;-)
 
-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/