You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/12/11 20:27:12 UTC

Re: hackathon notes from Sat

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Justin Mason writes:
>Hey,
>
>so we're talking over the "rule promotion" situation, and how "sa-update" will
>work, and we've come to an agreement that having committers manually cut and
>paste rules really won't scale, and is too much work.
>
>As a result, here's some notes from a whiteboard session where we're
>planning out how to fix it so rule-promotion and sa-update work....

These were written up very quickly.  I should expand them a bit to make
them more comprehensible. ;)

So, the idea is basically that we define a new "view" into the ruleset,
the "active set".  This is composed of the rules that a sysadmin would run
in a "live" production system [*], and therefore what is shipped out
through sa-update and in the basic release tarballs.

([*]: well, a production system would run this set of rules, plus the
"code-tied" rules like BAYES_*, etc.)

The stuff in "rules", the code-tied rules, are not updateable via
sa-update.  Their scores _are_, however, and they can be disabled and
replaced by setting their scores to 0, of course.   (In some cases, we may
want to move more of these into "rulesrc", protected by ifversion lines.)

Since it's important that we be able to update the scores frequently with
sa-update, the 50_scores.cf file also moves out of "rules" and into the
rulesrc tree.  Daniel has pointed out a good point -- it'll be easier to
twiddle this programmatically if we split this into multiple files, so
that'll happen too.  Multiple files also allows us to maintain multiple
scores files for different releases.

With the addition of the active set, we now have two "modes" for rules:

    1. development mode: when you're testing rules, running mass-checks,
    nightly/weekly mass-checks; this uses the entire "core" and "sandbox"
    rulesets.

    2. production mode: just running with the "active set".  This is what
    a deployed, live server in production use would use, what sa-update
    sends out, and what "make dist" will package.

The active set is produced entirely from "core" and "sandbox", and is
therefore a derivative product of those.  They are the source.


Rules, again, can live in the sandboxes, and be published directly from
there.   in other words, you can throw a rule into your sandbox,
and the next day it might show up in the "active set" being published
via sa-update.

Sometimes, this isn't desirable; so "tflags RULENAME nopublish" (or
something like that) will inhibit this.


Now, we were considering actually copying and deleting lines from
the source files, but every time I think about this, I come up with
more bad feelings about it.  So instead, I think this will work:

We have:

    - the "rulesrc/core" and "rulesrc/sandbox" source directories,
      where the source code for the rules live

    - the scores files: "rulesrc/core/50_scores*.cf"?  or a dir,
      "rulesrc/core/scores/*.cf"?  haven't decided this

    - a manifest file containing the names of rules that are publishable
      to the active set: "rules/active.list"?

    - a script which reads "rulesrc/core", "rulesrc/sandbox" and
      "rules/active.list", selects the active rules and scores, and
      outputs their lines to the active set output file.

    - the active set output, a single file: "rules/72_active.cf".
      (this could be multiple files, too, but let's start with
      a single.)


That should be pretty simple to implement!


OK, we now need a way for "make install" and "make dist" to tell the
development ruleset (used for mass-checks etc.) from the active set
(used in deployment, production use).

I suggest that we use the ordering numbers in the filenames in "rules" to
do this.

    - {00-69,71,73-99}_*.cf: always copied to both sets

    - 70_*.cf: development.  it already has this meaning, so this is
      backward-compatible!

    - 72_*.cf: active set.  This is a randomly-chosen number, picked to be
      unlikely to collide with an existing number.

By doing that, we can now tell "active set" and dev set rules
files apart in the "rules" directory, very useful for the
build and packaging scripts.


The stuff currently in build/mkrules to allow output to multiple
filenames: needs to be removed again, it makes no sense under
this scheme and adds too much complexity.

- --j.


>SVN TREE LAYOUT:
>----------------
>
>
>    trunk
>        -> lib (code, engine)
>
>        -> rules (code-tied ruleset, changes per version)
>            - GONE: 50_scores.cf
>
>    rulesrc
>        -> core
>            - current core ruleset
>            - *multiple* scores files
>                - taking over from 50_scores.cf
>                - can contain "ifversion" sections for specific
>                  releases
>
>        -> sandbox
>
>        -> active
>            - the new "active set" of rules published for sa-update.
>
>            - when "build/mkrules" runs, these are *not* copied into
>              the "rules" directory.
>
>
>Note that when "build/mkrules" is run, core and sandbox are copied, active is
>not.  active is purely a *subset* of the core and sandbox sets.
>
>
>
>TASKS IN PROCESS:
>-----------------
>
>
>NIGHTLY TAGGING FOR M-C (CENTRALISED):
>
>input: SVN
>output: SVN
>
>    - same as current
>
>MASS-CHECKS (DISTRIBUTED): [multiple users in parallel]
>
>input: SVN
>thru: mass-check
>output: logs
>
>    - same as current
>
>    - Note: mass-checks do not run with the "active set". They run with all of
>      rulesrc/core, and rulesrc/sandbox.  Only the end-user systems running
>      sa-update use the limited subset that's found in the "active set".
>
>RULE SELECTION/PROMOTION (CENTRALISED):
>
>input: SVN
>input: logs
>output: SVN, "active set"
>
>    - use previous day's logs (run at 0800 UTC)
>
>    - TODO?  need an SVN userid to commit results from cron?
>
>    - auto-promotion of "good" rules, automatically, from sandbox and core.
>      Normally all rules are autopromoted, based on how "good" they are. this
>      can be inhibited by setting a tflag, "tflags nopublish".
>
>        "nopublish" allows us to work on rules like T_FORGED_OUTLOOK_TAGS,
>        where it's a bug-fix of an existing rule, and it *would* be considered
>        immediately promotable.  We need a way to inhibit that, so that it's
>        under manual control. 
>
>        Also, the "T_" prefix implies this.   The corollary of this is
>        that rules in the sandbox no longer have to have a "T_" prefix;
>        they now only need that if they're "nopublish".   This helps
>        reduce the need to rename rules if they move from sandbox
>        to core.
>
>    - Promoted rules are *duplicated* from sandbox and core, into the
>      "active set".  This is the set of rules that are published in
>      an sa-update update file.
>
>    - "bad" rules in core are deleted.   That means *gone*, but can be
>      recovered from SVN history.
>
>        Rationale: bad, atrophied rules are pretty much never recoverable in
>        our experience!
>
>    - generate a domain-specific language script to perform
>      promotions/deletions/etc.
>
>    - Note: SVN trunk, mass-checks, etc. do not run with the "active set". They
>      run with all of rulesrc/core, and rulesrc/sandbox.  Only the end-user
>      systems running sa-update use the limited subset that's found
>      in the "active set".
>
>
>SCORING (CENTRALISED):
>
>input: SVN
>input: logs
>thru: perceptron/scoring
>output: SVN
>
>    - the logs contain all rules from "core" and "sandbox", but grep out only
>      the subset of rules that are in the active set so that the perceptron
>      doesn't try to use the others
>
>    - fix Bayes scores (I think this means set them to fixed values, instead
>      of letting them "float" and attempting to optimise with perceptron)
>
>    - Daniel says: TODO: fix rewrite-cf-with-new-scores to deal with:
>        - automated-generation vs. manual scores in separate files
>        - ifplugin blocks inside the scores files
>
>
>PACKAGING (CENTRALISED): 
>
>input: SVN, the "active set" only
>output: packages
>
>    - TODO:  need a password-less method to sign packages
>
>    - automated test suite for packages before they're published
>
>    - The package will contain both new rules, and rules that were part of
>      "core" for the 3.1.0 release.  To avoid the latter conflicting with rules
>      in the 3.1.x release, we will produce a 3.1.x point release that deletes
>      the ruleset from /usr/share/spamassassin, and immediately runs
>      "sa-update"!
>
>    - assume 3.1.x and earlier versions can safely use scores generated
>      against "svn trunk" for the "active" set, even though they may
>      not be exactly accurate for that release.  (the alternative is
>      running a full mass-check for all releases -- too much!)
>
>
>
>RULE STATES:
>------------
>
>These are the states that rules pass through.
>
>
>    Rules in sandbox:
>
>        - experimental -- don't promote me.  "T_" prefix implies this.
>          "tflags nopublish" ditto.
>
>        - s_poor -- promotable, but not meeting promotion criteria.
>
>        - s_good -- promotable, and meeting criteria.  Rules in this
>          state are copied into the "active set".
>
>    Rules in core:
>
>        - c_poor -- promotable, but not meeting promotion criteria.
>
>        - c_good -- promotable, and meeting criteria.  Rules in this state are
>          copied into the "active set".
>
>    Deleted rules:
>
>        - gone -- rule has been deleted.   If a rule is in c_poor for "an
>          extended period of time", it goes here.
>
>
>So the permitted transitions are:
>
>        - experimental <---> s_poor
>        - experimental <---> s_good
>        - s_poor <---> s_good
>        - c_poor <---> c_good
>        - c_poor -> gone
>
>
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDnH2QMJF5cimLx9ARAvRkAJ0a56wbJrgBPqMorSL7J72Yvd6fMQCgonl5
9qNeQ6NA6RCPVqG7jE2Bok8=
=X8i/
-----END PGP SIGNATURE-----