You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/12/11 02:36:52 UTC

hackathon notes from Sat

Hey,

so we're talking over the "rule promotion" situation, and how "sa-update" will
work, and we've come to an agreement that having committers manually cut and
paste rules really won't scale, and is too much work.

As a result, here's some notes from a whiteboard session where we're
planning out how to fix it so rule-promotion and sa-update work....


SVN TREE LAYOUT:
----------------


    trunk
        -> lib (code, engine)

        -> rules (code-tied ruleset, changes per version)
            - GONE: 50_scores.cf

    rulesrc
        -> core
            - current core ruleset
            - *multiple* scores files
                - taking over from 50_scores.cf
                - can contain "ifversion" sections for specific
                  releases

        -> sandbox

        -> active
            - the new "active set" of rules published for sa-update.

            - when "build/mkrules" runs, these are *not* copied into
              the "rules" directory.


Note that when "build/mkrules" is run, core and sandbox are copied, active is
not.  active is purely a *subset* of the core and sandbox sets.



TASKS IN PROCESS:
-----------------


NIGHTLY TAGGING FOR M-C (CENTRALISED):

input: SVN
output: SVN

    - same as current

MASS-CHECKS (DISTRIBUTED): [multiple users in parallel]

input: SVN
thru: mass-check
output: logs

    - same as current

    - Note: mass-checks do not run with the "active set". They run with all of
      rulesrc/core, and rulesrc/sandbox.  Only the end-user systems running
      sa-update use the limited subset that's found in the "active set".

RULE SELECTION/PROMOTION (CENTRALISED):

input: SVN
input: logs
output: SVN, "active set"

    - use previous day's logs (run at 0800 UTC)

    - TODO?  need an SVN userid to commit results from cron?

    - auto-promotion of "good" rules, automatically, from sandbox and core.
      Normally all rules are autopromoted, based on how "good" they are. this
      can be inhibited by setting a tflag, "tflags nopublish".

        "nopublish" allows us to work on rules like T_FORGED_OUTLOOK_TAGS,
        where it's a bug-fix of an existing rule, and it *would* be considered
        immediately promotable.  We need a way to inhibit that, so that it's
        under manual control. 

        Also, the "T_" prefix implies this.   The corollary of this is
        that rules in the sandbox no longer have to have a "T_" prefix;
        they now only need that if they're "nopublish".   This helps
        reduce the need to rename rules if they move from sandbox
        to core.

    - Promoted rules are *duplicated* from sandbox and core, into the
      "active set".  This is the set of rules that are published in
      an sa-update update file.

    - "bad" rules in core are deleted.   That means *gone*, but can be
      recovered from SVN history.

        Rationale: bad, atrophied rules are pretty much never recoverable in
        our experience!

    - generate a domain-specific language script to perform
      promotions/deletions/etc.

    - Note: SVN trunk, mass-checks, etc. do not run with the "active set". They
      run with all of rulesrc/core, and rulesrc/sandbox.  Only the end-user
      systems running sa-update use the limited subset that's found
      in the "active set".


SCORING (CENTRALISED):

input: SVN
input: logs
thru: perceptron/scoring
output: SVN

    - the logs contain all rules from "core" and "sandbox", but grep out only
      the subset of rules that are in the active set so that the perceptron
      doesn't try to use the others

    - fix Bayes scores (I think this means set them to fixed values, instead
      of letting them "float" and attempting to optimise with perceptron)

    - Daniel says: TODO: fix rewrite-cf-with-new-scores to deal with:
        - automated-generation vs. manual scores in separate files
        - ifplugin blocks inside the scores files


PACKAGING (CENTRALISED): 

input: SVN, the "active set" only
output: packages

    - TODO:  need a password-less method to sign packages

    - automated test suite for packages before they're published

    - The package will contain both new rules, and rules that were part of
      "core" for the 3.1.0 release.  To avoid the latter conflicting with rules
      in the 3.1.x release, we will produce a 3.1.x point release that deletes
      the ruleset from /usr/share/spamassassin, and immediately runs
      "sa-update"!

    - assume 3.1.x and earlier versions can safely use scores generated
      against "svn trunk" for the "active" set, even though they may
      not be exactly accurate for that release.  (the alternative is
      running a full mass-check for all releases -- too much!)



RULE STATES:
------------

These are the states that rules pass through.


    Rules in sandbox:

        - experimental -- don't promote me.  "T_" prefix implies this.
          "tflags nopublish" ditto.

        - s_poor -- promotable, but not meeting promotion criteria.

        - s_good -- promotable, and meeting criteria.  Rules in this
          state are copied into the "active set".

    Rules in core:

        - c_poor -- promotable, but not meeting promotion criteria.

        - c_good -- promotable, and meeting criteria.  Rules in this state are
          copied into the "active set".

    Deleted rules:

        - gone -- rule has been deleted.   If a rule is in c_poor for "an
          extended period of time", it goes here.


So the permitted transitions are:

        - experimental <---> s_poor
        - experimental <---> s_good
        - s_poor <---> s_good
        - c_poor <---> c_good
        - c_poor -> gone




Re: hackathon notes from Sat

Posted by Warren Togami <wt...@redhat.com>.
Duncan Findlay wrote:
> On Wed, Dec 14, 2005 at 11:36:11AM -0800, Justin Mason wrote:
>> Duncan Findlay writes:
> 
>>> Right. I also don't see any need to split the rules out of the main
>>> package -- spamassassin just needs to be smart enough to use the right
>>> set of rules -- either where sa-update drops them or where they are
>>> installed by default.
>> So you're suggesting we'd have:
>>
>>     /usr/share/spamassassin/72_active.cf: base, released copy of
>>          rule updates
>>     /etc/mail/spamassassin/sa_update.cf: override of that default set
>>
>> ??
> 
> Yes, except that I'd argue /etc/ isn't the right place for it
> either. I'm really thinking it should go in /var/lib somewhere. But
> that would mean we'd have the following:
>  
>  /etc/spamassassin | /etc/mail/spamassassin	- site config
>  /usr/share/spamassassin | ...			- default rules
>  /var/lib/spamassassin 				- sa-update drop directory

Very strong ++ here.

Warren Togami
wtogami@redhat.com

Re: hackathon notes from Sat

Posted by Duncan Findlay <du...@debian.org>.
On Wed, Dec 14, 2005 at 11:36:11AM -0800, Justin Mason wrote:
> Duncan Findlay writes:

> >Right. I also don't see any need to split the rules out of the main
> >package -- spamassassin just needs to be smart enough to use the right
> >set of rules -- either where sa-update drops them or where they are
> >installed by default.
> 
> So you're suggesting we'd have:
> 
>     /usr/share/spamassassin/72_active.cf: base, released copy of
>          rule updates
>     /etc/mail/spamassassin/sa_update.cf: override of that default set
> 
> ??

Yes, except that I'd argue /etc/ isn't the right place for it
either. I'm really thinking it should go in /var/lib somewhere. But
that would mean we'd have the following:
 
 /etc/spamassassin | /etc/mail/spamassassin	- site config
 /usr/share/spamassassin | ...			- default rules
 /var/lib/spamassassin 				- sa-update drop directory

> I could go for that.  We'd have to modify the Mail::SpamAssassin code
> to recognise the 72_active.cf file somehow and allow it to be ignored
> in the system rules dir, if it appears in the site rules dir.

Are we going to be consolidating all the rules to one file? It would
make it tougher for users to read and play with, if that's a concern.

-- 
Duncan Findlay

Re: hackathon notes from Sat

Posted by Duncan Findlay <du...@debian.org>.
On Tue, Dec 13, 2005 at 03:49:44PM -0500, Warren Togami wrote:
> Duncan Findlay wrote:
> >The only problem I see with the above, is that no script should be
> >overwriting rules that are distributed in a package. So if I
> >distribute a spamassassin-rules .deb, which would stick files in
> >/usr/share/spamassassin, no script should go in and overwrite those
> >rules. sa-update should be writing to somewhere in
> >/var/lib/spamassassin (or /var/cache/spamassassin ?) and
> >spamassassin/spamd should be reading from that location if it exists.
> >
> >So, looks like spamassassin/spamd probably needs to be modified to
> >read from /var/lib/spamassassin if we want sa-update to work this way.
> >
> 
> I am in agreement that sa-update should download rules/scores into 
> somewhere in /var, and it shouldn't overwrite files distributed by the 
> package.  I am not so sure I like the separate co-dependent package for 
> scores thing as a requirement.

Right. I also don't see any need to split the rules out of the main
package -- spamassassin just needs to be smart enough to use the right
set of rules -- either where sa-update drops them or where they are
installed by default.

> I am a little confused about the terminology, active-set means network 
> tests right?

I believe "active-set" refers to the latest scored set of rules -- the
idea being that rules will be updated more often than code.

-- 
Duncan Findlay

Re: hackathon notes from Sat

Posted by Warren Togami <wt...@redhat.com>.
Duncan Findlay wrote:
> The only problem I see with the above, is that no script should be
> overwriting rules that are distributed in a package. So if I
> distribute a spamassassin-rules .deb, which would stick files in
> /usr/share/spamassassin, no script should go in and overwrite those
> rules. sa-update should be writing to somewhere in
> /var/lib/spamassassin (or /var/cache/spamassassin ?) and
> spamassassin/spamd should be reading from that location if it exists.
> 
> So, looks like spamassassin/spamd probably needs to be modified to
> read from /var/lib/spamassassin if we want sa-update to work this way.
> 

I am in agreement that sa-update should download rules/scores into 
somewhere in /var, and it shouldn't overwrite files distributed by the 
package.  I am not so sure I like the separate co-dependent package for 
scores thing as a requirement.

I am a little confused about the terminology, active-set means network 
tests right?

Warren Togami
wtogami@redhat.com

Re: hackathon notes from Sat

Posted by Duncan Findlay <du...@debian.org>.
On Sun, Dec 11, 2005 at 12:35:46PM -0800, Justin Mason wrote:
> OK, we're rethinking this; it no longer seems necessary for it
> to be a requirement, and you have good points there.
> 
> What about this?
> 
>   - basic "spamassassin" package (rpm/deb) contains no active-set rules
> 
>   - there's another package which contains the active-set rules, in the
>     location where "sa-update" can later overwrite them
> 
>   - both packages co-depend on each other.
> 
> The second package can be updated either via distro packaging methods --
> apt-get/yum, or can be overwritten using "sa-update".

Yeah, sorry I didn't read the original message carefully enough. I
think I'm pretty much in agreement with Warren though as far as
requirements go.

The only problem I see with the above, is that no script should be
overwriting rules that are distributed in a package. So if I
distribute a spamassassin-rules .deb, which would stick files in
/usr/share/spamassassin, no script should go in and overwrite those
rules. sa-update should be writing to somewhere in
/var/lib/spamassassin (or /var/cache/spamassassin ?) and
spamassassin/spamd should be reading from that location if it exists.

So, looks like spamassassin/spamd probably needs to be modified to
read from /var/lib/spamassassin if we want sa-update to work this way.

-- 
Duncan Findlay

Re: hackathon notes from Sat

Posted by Warren Togami <wt...@redhat.com>.
Justin Mason wrote:
> PACKAGING (CENTRALISED): 
> 
> input: SVN, the "active set" only
> output: packages
> 
>     - TODO:  need a password-less method to sign packages
> 
>     - automated test suite for packages before they're published
> 
>     - The package will contain both new rules, and rules that were part of
>       "core" for the 3.1.0 release.  To avoid the latter conflicting with rules
>       in the 3.1.x release, we will produce a 3.1.x point release that deletes
>       the ruleset from /usr/share/spamassassin, and immediately runs
>       "sa-update"!
> 

Could you please clarify what this means?  We have the following general 
restrictions on any package we ship in Fedora.  I don't know much about 
the current proposed implementation, but the way it is worded in this 
paragraph, it may be incompatible with these restrictions.

1) Download scores during buildtime
For security reasons build systems should rely only on local sources and 
not rely on the network.  The build payload is also not reproducible if 
it relies on network inputs.

2) Download scores upon package install
We cannot assume that users have networking during package installation.

3) Automatic sa-update by default
We cannot ship a package that makes outgoing network calls without 
explicit setting of the sysadmin.  For the same reason, our spamd 
service is not started by default, and our evolution default config uses 
only local tests when it uses spamassassin.  Explicit enabling of the 
spamassassin service or modifying evolution's configuration then allows 
network querying.

We would need to ship Fedora/RHEL's spamassasin with a default set of 
scores shipped in our package for payload reproducibility.  It is up to 
the system's user whether they want to run sa-update or not.  Note that 
this does not mean that the scores we ship need be computed at the time 
of a release.  Our package updates could contain a newer set.

Is there any plan for exactly how sa-update will be run periodically? 
In order to avoid overloading the data source, it should run at random 
intervals.

http://cvs.fedora.redhat.com/viewcvs/devel/clamav/?root=extras
Fedora Extras clamav package has an ugly but effective example of 
randomized interval updating.  Perhaps the sysadmin could activate a 
separate sa-update daemon, or sa-update could be run periodically by 
spamd itself?  Just some ideas...

Warren Togami
wtogami@redhat.com